unum-cloud / uform

Pocket-Sized Multimodal AI for content understanding and generation across multilingual texts, images, and 🔜 video, up to 5x faster than OpenAI CLIP and LLaVA 🖼️ & 🖋️
https://unum-cloud.github.io/uform/
Apache License 2.0
1.03k stars 62 forks source link

CoreML FP16 model #50

Closed laclouis5 closed 11 months ago

laclouis5 commented 11 months ago

I was playing around with CoreML exports and I'm using the coco-sm tool to assess the performance. I benchmarked three configurations of the multilingual V2 model:

The CoreML FP32 model yields metrics very close to the original PyTorch model, which is fine. However, the CoreML FP16 model gives metrics close to zero for all languages.

It looks like that the drop in performance is due to the text-encoder only. I tried to export the image-encoder to FP16 and to keep the text-encoder in FP32 and this gave a performance en par with the FP32 model.

This needs more investigation but this may be due to an overflow issue in some of the weights of the text-encoder during the FP16 conversion:

RuntimeWarning: overflow encountered in cast

Next I'm going to try the same thing with the OpenCLIP model to see if this issue only affects UForm or the text LLM text encoders in general. This would be quite unfortunate because the FP32 multilingual model is quite heavy (> 400 MB) and this would be great to store it in FP16 to reduce its size.

laclouis5 commented 11 months ago

I get the time to move on this issue.

I tried to export the Open-CLIP model and faced the same overflow issue when converting to FP16, thus I guess that the issue is common to text encoders in general.

I investigated the affected layers and found that the affected weights/operations are only found in the mul operator of CoreML. I was able to successfully convert a UForm model to FP16 by keeping this operation in FP32:

text_encoder_ct = ct.convert(
    text_encoder_tr,  # the traced Torch text encoder
    inputs=[
        ct.TensorType(
            name="input_ids",
            shape=sample_input_ids.shape,
            dtype=sample_input_ids.numpy().dtype,
        ),
        ct.TensorType(
            name="attention_mask",
            shape=sample_attention_mask.shape,
            dtype=sample_attention_mask.numpy().dtype,
        ),
    ],
    outputs=[
        ct.TensorType(name="features"),
        ct.TensorType(name="embeddings"),
    ],
    convert_to="mlprogram",
    compute_precision=ct.transform.FP16ComputePrecision(
        op_selector=lambda op: op.op_type != "mul"
    ),
)

The accuracy of the FP16 model is almost equal to the accuracy of the FP32 model on the coco-sm dataset. I was even able to palettize the weights on 8 bits without a significant drop in accuracy. The number of weights affected by this operator is very low (or even zero?), thus the size on disk of the serialized model is almost divided by 2 compared to the FP32 model. I did not fully benchmarked the speed yet but the inference speed seems sufficient. I'll profile the speed in an upcoming benchmark.

ashvardanian commented 11 months ago

@laclouis5, amazing! Keep us updated here and/or on Discord 🤗

laclouis5 commented 11 months ago

So, I profiled the FP16 model using the Xcode profiler and on my machine (M1 Pro) the inference speed is 8 ms for the FP16 model vs 4 ms for the FP32 model. However, I noticed in the Instruments CoreML trace that there is a lot of transfers between the ANE and the GPU during a prediction step:

Screenshot 2023-10-26 at 20 33 13

I suspected that the slowdown between FP16 and FP32 may be caused by those transfers and indeed, executing the inference on the CPU+GPU configuration gave back the original 4 ms speed!

ashvardanian commented 11 months ago

That's very impressive, @laclouis5! Thank you for sharing! The fastest we've got UForm running is 1.3ms on recent Intel CPUs, but those are beefy server grade chips. Looking forward to what the M3 will be capable of!

laclouis5 commented 11 months ago

I did some more investigation and now I got it running in 1.07ms on my M1 Pro 🥳.

I noticed that the FP16 overflow only occurs in an early layer of the model dealing with the attention mask and not in the mul layers deeper in the model. Thus, by only keeping the first mul operator in FP32 we can avoid all the context transfers between the GPU and ANE caused by the FP32 operations in the middle of the model.

The model now runs almost fully on the ANE:

Screenshot 2023-10-27 at 09 48 13

I ran the coco-sm benchmark and the accuracy seems to be on par with the FP32 model.

Here is the conversion settings I used:

text_encoder = ct.convert(
    text_encoder,
    convert_to="mlprogram",
    inputs=[
        ct.TensorType(
            name="input_ids",
            shape=(text_batch_dim_shape,) + input_ids.shape[1:],
            dtype=input_ids.numpy().dtype,
        ),
        ct.TensorType(
            name="attention_mask",
            shape=(text_batch_dim_shape,) + attention_mask.shape[1:],
            dtype=attention_mask.numpy().dtype,
        ),
    ],
    outputs=[ct.TensorType(name="features"), ct.TensorType(name="embeddings")],
    compute_precision=ct.transform.FP16ComputePrecision(
        op_selector=lambda op: op.name != "attn_mask.1"
    ),
)
mixeden commented 11 months ago

omg wow

ashvardanian commented 11 months ago

We have even smaller models coming up in parallel with the M3 Macs, so get ready @laclouis5 and @mixeden 🤗

ratan commented 11 months ago

That's very impressive, @laclouis5! Thank you for sharing! The fastest we've got UForm running is 1.3ms on recent Intel CPUs, but those are beefy server grade chips. Looking forward to what the M3 will be capable of!

Thats good news. Did you do any different setting, argument etc to run on Intel CPU?

ashvardanian commented 11 months ago

@ratan We've used ONNX and OpenVINO for inference.