Convert OpenELM to float16 Core ML

pcuenca commented 6 months ago

I converted the models to float32 using this script: https://gist.github.com/pcuenca/23cd08443460bc90854e2a6f0f575084, but found precision problems when targeting float16. It'd be interesting to see what the performance is for float16, but we need to determine what layers/ops need to be kept in float32. Anyone interested please let us know and we can work on it or test together :)

shavit commented 6 months ago

I'm interested.

0seba commented 6 months ago

Hey, managed to track the issue to the CoreML converter pipelines, When using the following as pass_pipeline parameter to the convert method I observe in the MIL program that matmuls and other ops are scheduled to run on FP16, and a 20-30% speedup on a 8GB M1 Air, compared to the reference CoreML model that runs on FP32 you uploaded @pcuenca.

pipeline = ct.PassPipeline.EMPTY
pipeline.append_pass("common::const_elimination")
pipeline.append_pass("common::add_fp16_cast")
pipeline.append_pass("common::dedup_op_and_var_names")

These are the minimal passes required for it to run. I do not see ANE usage when converting to ALL compute devices, but do see a medium-low usage when converting to CPU_AND_NE, but with a much lower inference time. This is related to the fact that without the pass pipeline optimizations there are still a lot of additional operations that involve FP32 precision.

I tried adding the cast_optimization pass, but it causes the model predictions to be erroneous again, so the issue is probably related to this optimization.

0seba commented 6 months ago

I think the main issue is with the RMSNorm, had to do some hacky hacks to partially execute it in fp32.

I think that the most accurate way to normalize in CoreML is using the MIL op l2_norm, which isn't accesible from Pytorch, so I patched the torch.acos operation (unused in the OpenELM graph) and told it to map to the CoreML l2_norm. Correct solution would have been to define a custom op in torch.library, but I couldn't manage to make that work on Mac. Additionally, my hacks only work for batch size 1, reason explained in the code comments.

del _TORCH_OPS_REGISTRY["acos"]

eps = 1e-5

@register_torch_op
def acos(context, node):
    x, = _get_inputs(context, node, expected=1)
    x = mb.expand_dims(x=x, axes=[-1, -2]) # l2_norm works on the last 3 dimensions, so we have to expand 2 dims
    x = mb.l2_norm(x=x, epsilon=eps)
    x = mb.squeeze(x=x, axes=[-1, -2], name=node.name)
    context.add(x)

And made CustomRMSNorm that utilizes acos

class CustomRMSNorm(nn.Module):
    def __init__(self, weight, eps):
        super().__init__()
        self.weight = weight
        self.hscale = weight.size(0) ** 0.5
        self.eps = eps

    def forward(self, x):
        # CoreML works with inputs up to 5 dimensions, so the queries and keys normalization would
        # fail because they have (batch, sequence, nheads, hdim) 4 dimensions, and we expand 2 additional dims
        # so we squeeze the batch dim and unsqueeze it after
        # THIS MEANS THAT THIS METHOD CURRENTLY WORKS WITH BATCH SIZE 1
        if len(x.size()) == 4:
            x = x.squeeze(0)
            unsqueeze = True
        else:
            unsqueeze = False
        x = x.acos()
        if unsqueeze:
            x = x.unsqueeze(0)
        return x * self.weight * x.size(-1) ** 0.5 # l2_norm does not perform scaling with the sqrt of dim (.pow().mean() in Pytorch), so we do it here

And we replace all the RMSNorm layers with our custom layer

model.transformer.norm = CustomRMSNorm(model.transformer.norm.weight, model.transformer.norm.eps)

for layer in model.transformer.layers:
    layer.attn.q_norm = CustomRMSNorm(layer.attn.q_norm.weight, layer.attn.q_norm.eps)
    layer.attn.k_norm = CustomRMSNorm(layer.attn.k_norm.weight, layer.attn.k_norm.eps)
    layer.ffn_norm = CustomRMSNorm(layer.ffn_norm.weight, layer.ffn_norm.eps)
    layer.attn_norm = CustomRMSNorm(layer.attn_norm.weight, layer.attn_norm.eps)

Finally, we perform the l2_norm operation in fp32

def selector(op):
    return op.op_type != "l2_norm"

compute_precision = ct.transform.FP16ComputePrecision(op_selector=selector)

coreml_model = ct.convert(
    ...,
    compute_precision=compute_precision,
)

This modification do no require to modify the pass_pipelines I mention in my previous reply.

These hacks provide a ~20-30% speedup against the fp32 CoreML, while achieving very similar outputs. From what I saw it still runs mainly on GPU with a low ANE usage. It is possible to execute the l2_norm in fp16 (remove the op_selector), but the output difference increases. With this, the model runs mainly on ANE with a higher usage, and almost none GPU, but for some reason it is about 2x slower than the fp32 model, I think it may be related to the vast amount of tensor reshaping a concating operations.

Some other considerations, on why I used l2_norm instead of executing the whole RMSNorm in fp32 with the original operations; the normalization uses the following operations: pow, reduce_mean, rsqrt, mul, add. With the op_precision selector we would have needed to execute those operations in fp32 for the whole model, preventing our goal of running the model in fp16. I think it is possible to use the operation name (CoreML assings a different one for every individual separate operation), but this is more difficult because when converting from Pytorch the names are assigned automatically and change all the time. It is also possible to directly manipulate the generated MIL, but again this is more difficult and inconvenient.

I think the best solution is to directly implement the model in MIL ops, I've been working on this for the past week, and already have a functioning GPT-2 model, hopefully I'll be able to have the OpenELM implementation someday next week.

0seba commented 6 months ago

I think this is my last update regarding this, the activations' norms were enormous, in the order of hundreds of thousands, impossible for fp16 to handle. Modified the RNSNorm to use a stable version of normalization, here is the code. This allows to run completely in fp16, seeing about a 2x speedup versus fp32 reference. Also replaced the direct l2_norm normalization with an extended version that calculates the norm first and I have to make the division in another operation, since this eliminates the need for expands and squeezes this eliminates the restriction to batch size 1 (asides from the repeat_interleave, doubt I'll look into that).

del _TORCH_OPS_REGISTRY["acos"]

eps = 1e-6

@register_torch_op
def acos(context, node):
    x, = _get_inputs(context, node, expected=1)
    x = mb.reduce_l2_norm(x=x, axes=[-1], keep_dims=True, name=node.name)
    context.add(x)

def stable_l2_norm(x, eps):
    max_val = x.abs().max(axis=-1, keepdim=True).values
    max_val = torch.clamp(max_val, min=eps)
    xscaled = x / max_val
    scaled_norm = torch.acos(xscaled)
    return x / torch.clamp(scaled_norm, min=eps), max_val

class CustomRMSNorm(nn.Module):
    def __init__(self, weight, eps):
        super().__init__()
        self.weight = weight
        self.eps = eps

    def forward(self, x):
        x, max_val = stable_l2_norm(x, self.eps)
        return x * (x.size(-1) ** 0.5 / max_val) * self.weight

With this modification, the maximum relative error rate for a single batch, length 128 with random input ids is 6.51% and 0.33% the average error. Percentiles errors are; 99%: 1.2% relative error, 99.5%: 1.39%, 99.9%: 1.83%, 99.99%: 2.6%, and histogram of error distribution:

relative_error

Same with cumulative percent in log scale: relative_error_log

I thought that RoPE embeddings may have been another source of error since those are also cast to float32 in the OpenELM Pytorch code, but couldn't find a big effect in the error from that.

smpanaro commented 6 months ago

You don't need to use the acos workaround. coremltools will convert scaled_norm = torch.linalg.norm(xscaled, dim=-1, keepdim=True) to mb.reduce_l2_norm.

Also, it's interesting that you need to clamp. Llama 2 exhibits similar large activation outliers (as do other models, per this paper) but I didn't need to clamp for it to work. I am doing dim=1 (different tensor layout) so possible that is why.

0seba commented 6 months ago

Thanks for the norm op. Clamp is done only on min val to prevent division by 0, just like it better than adding eps, but I doubt it provides any difference

antmikinka commented 6 months ago

I tried @0seba code and removed @smpanaro's suggestion. Converted, max diff for random inputs: 52.12176513671875 got during my error. I uploaded the script, .mlpackage below. All ANE utilization besides 5 ops. On M1 8GB Macbook Pro.

https://github.com/antmikinka/swift-transformers-test

pcuenca commented 6 months ago

Thanks all for the great comments and analysis!

Turns out the error on random inputs happens because inference is running on CPU for validation. When you run on GPU, generations are actually fine! I haven't tested the Neural Engine yet, will do and report back.

It's interesting that some op is not working properly on CPU when using half precision, I don't think I've seen it before in other models. Worth diving deeper in my opinion :)

0seba commented 6 months ago

Seems that mb.reduce_l2_norm does not run on NPU, but combining reduce_sum_square and rsqrt does. Also I'm having issues with the prediction head, I split the 32k vocab size intro matrices of 4k tokens, and apply matmuls with the last hidden states separately. When I try to concat the predictions, it supports only up to a size of around 16k (only tried with multiples of 4k), trying with 20k fails and fallbacks to CPU which is much slower. I'm running on an M1 Air for reference.

antmikinka commented 6 months ago

Made Optimization Guidelines for the Apple Neural Engine.txt for targeting the ANE based on:

@smpanaro's finds in his more-ane-transformers
Deploying Transformers on the Apple Neural Engine
Deploying Attention-Based Vision Transformers to Apple Neural Engine

Also uploaded a palettized model anthonymikinka/OpenELM-1_1B-Instruct-128-FP16ComputePrecision-Palettized-Kmeans-4bits

I am unable to run the performance report due to the RAM issue. I wasnt sure how to test this, tried using the Swift-Chat XCode project, but wasnt working.

Hopefully this helps @0seba

pcuenca commented 6 months ago

Very nice summary and great collection of resources @antmikinka! 🙌

The model you linked produces 16-bit outputs, which are still a bit less compatible than float32. Could you export such that the output is float32, and I can try to run with swift-transformers? You'd do it by specifying the output dtype like this:

coreml_output_types = [ct.TensorType(name=name, dtype=np.float32) for name in outputs.keys()]

pcuenca commented 6 months ago

This is your model running on Xcode @antmikinka:

antmikinka commented 5 months ago

@pcuenca I'm glad that the resources are helpful. Here is are some links to the new models with that requested change. I didn't change anything else besides that one piece of code given on my script.

Original Model: anthonymikinka/OpenELM-1_1B-Instruct-128-FP16ComputePrecision_v2 Palettized 4Bit Model: anthonymikinka/OpenELM-1_1B-Instruct-128-FP16ComputePrecision_v2-Palettized-Kmeans-4bits Palettized 6Bit Model: anthonymikinka/OpenELM-1_1B-Instruct-128-FP16ComputePrecision_v2-Palettized-Kmeans-6Bits

SpiraMira commented 5 months ago

@antmikinka - as a reference : your v2 6Bits model running on 2021 M1 Pro 10 core 16GB. As a reference...

any news on ANE ? I don’t have enough memory to performance test on my machine.

antmikinka commented 5 months ago

If I am not mistaken, the repeat_interleave op is one op we're deleting. This op also deals with kv cache. apple/ml-recurrent-drafter was just updated last week with 5499 file additions, a number of them copyrighted from 2020. Maybe we can take some of this information and apply it to swift-transformers?

ANE optimization:

would be model block splitting / cropping,
- smpanaro/more-ane-transformers/models/gpt2.py
- apple/ml-stable-diffusion/python_coreml_stable_diffusion/chunk_mlprogram.py
optimization from coremltools
fp16 model converstion / layers (if layers or ops need to stay in fp32, then which ones?)
OpenELM Modeling ANE Principles Overview

ml-recurrent-drafter repo incorporates ANE Principles with an example of llama:

Here are some of the files:

attention.py, incorporates kv_cache
autoregressive.py, model interference with kv_cache, multiple shapes
kv_cache.py
recurrent_drafting.py
tree_attention.py

antmikinka commented 5 months ago

Updated antmikinka/swift-transformers-test yesterday. Check it out for more information.

I have two different viewing methods for the models. Here are the model layers, ops, and precison.

CoreMLInspect Models - Layers, OPs, & Precision

CoreMLInspect-OpenELM-1B-Instruct
CoreMLInspect-OpenELM-270M-Instruct
- edit: just realized that this may be the chunked model, will have to update repo and make sure to differentiate

layer-iteration.py Model - Layers, OPs, & Precision

OpenELM-270M-Instruct

model chunking

I have chunked the OpenELM-270M-Instruct. I am unsure how it performs, need to update my chunk_mlprogram.py to calculate the PSNR. Noticed after uploading to HF, mlpackage went from 625MB to 482MB in size. Chunking did change what ops run where, increased some to other compute units.

HF: anthonymikinka/OpenELM-270M-Instruct-128-FP16ComputePrecisionv2_chunked_pipeline

antmikinka commented 5 months ago

Apple updating a lot of work with coremltools probably to showcase the upcoming WWDC 2024 Day 2 event.

Stateful Model Applications

Using state input types can be convenient for working with models that require storing some intermediate values, updating them and then reusing them in subsequent predictions to avoid extra computations. One such example of a model is a language model (LM) that uses the transformer architecture and attention blocks. An LM typically works by digesting sequences of input data and producing output tokens in an auto-regressive manner: that is, producing one output token at a time, updating some internal state in the process, using that token and updated state to do the next prediction to produce the next output token, and so on.

In the case of a transformer, which involves three large tensors that the model processes : “Query”, “Key”, and “Value”, a common optimization strategy is to avoid extra computations at token generation time by caching the “Key” and “Value” tensors and updating them incrementally to be reused in each iteration of processing new tokens. This optimization can be applied to Core ML models by making the Key-Values, as explicit inputs/outputs of the model. Here is where State model types can also be utilized for more convenience and potential runtime performance improvements. For instance, please check out the 2024 WWDC session for an example that uses the Mistral 7B model and utilizes the stateful prediction feature for improved performance on a GPU on a macbook pro.

CoreMLTools 8.0b1 Release Link

Optimization Overview CoreMLTools8.0b1 New optimization techniques for transformers & examples Watch out for the below issue with this beta release.

Known Issues Conversion will fail when using certain palettization modes (e.g. int8 LUT, vector palettization) with torch models using ct.optimize.torch Some of the joint compression modes when used with the training time APIs in ct.optimize.torch will result in a torch model that is not correctly converted The post-training palettization config for mlpackage models (ct.optimize.coreml.``OpPalettizerConfig) does not yet have all the arguments that are supported in the cto.torch.palettization APIs (e.g. lut_dtype (to get int8 dtyped LUT), cluster_dim (to do vector palettization), enable_per_channel_scale (to apply per-channel-scale) etc).

huggingface / swift-transformers