huggingface / exporters

Export Hugging Face models to Core ML and TensorFlow Lite
Apache License 2.0
572 stars 35 forks source link

SegFormer model exported to CoreML is slow #56

Closed laclouis5 closed 9 months ago

laclouis5 commented 9 months ago

I was trying to export Segformer models to CoreML but the exported model is slow compared to the same model exported on my own.

I tried to export the model using the following command:

python -m exporters.coreml --model=nvidia/mit-b2 --feature=semantic-segmentation exports/

This model median prediction time is 500ms on my MacBook Pro M1 using all the available accelerators (ANE, GPU, CPU), above the 300ms of the same model exported on my own using coremltools directly.

I did a little of profiling to identify the issue using Xcode Instruments. It look like the model is exported and executed in Float32. This greatly undermined the performance since Float16 data is required for the ANE to be used. Thus, the ANE is not used at all and the model is executed on GPU only on most devices. Also, Float32 computations are slower than Float16 computations on the GPU, thus Float32 should be avoided when possible. In the coremltools documentation Apple suggests to use Float16 as a default and as of version 7.0 Float16 is the default precision for CoreML exports.

With the option --quantize=float16 the inference time is on par with the model I exported (around 300ms). I suggest to use the coremltools default Float16 precision instead of Float32 in order to get the most of the specialized hardware or Apple platforms.

I also noted another issue but not related to the exporters framework. In Float16 and with the ANE, the Instruments trace suggests that half of the prediction time is spent in GPU kernels. That is weird since only 1 operator is executed on the GPU in this case: the argmax operation at the end of the model. This slowdown needs further investigation but this may be due to the large size of the input tensor (1000x512x512). I tried with only 16 output classes and the inference time drop down to 60ms.

Screenshot 2023-10-01 at 12 31 47
pcuenca commented 9 months ago

Thanks a lot @laclouis5! Very interesting observations!

The philosophy we have been applying is to preserve the weights in their original precision when possible, but I tend to agree that it probably makes sense to default to 16-bit for exporters. If you'd like to propose a PR I'd be happy to review it; otherwise I'll get to it in a few days :)

Very interesting comment about argmax performance too. Do you think something like https://developer.apple.com/documentation/accelerate/bnns/reductionfunction/argmax could be faster in this particular case? Another option could be to write a custom Metal kernel, but that sounds similar to what appears to be happening under the hood.

laclouis5 commented 9 months ago

I just opened a PR that just changes the default behavior (#58).

For the argmax thing, I did not try the BNNS or MPS custom kernel. Would be great to compare but I assume that Apple used its best implementation in CoreML anyway. I rapidly tested the performance of argmax in PyTorch using MPS just to get an idea and the operation seems slow there too. I got 300 ms for an argmax on the channel dim for a tensor of shape 1x1000x512x512.

This may not be easily solvable. However, having 1000 classes for semantic segmentation is very rare I guess, so this may not be an issue in practice.

laclouis5 commented 8 months ago

In the PR implementing the changes mentioned above I mentioned that in my experience I never observed a drop in performance when using FP16 on vision models but that did not try LLM's:

I personally never observed a significant drop in performance when exporting vision models to CoreML in Float16 (Yolov8, UNets, etc.). I wasn't able to run the tests because of dependency issues but this would be great to know if a network suffers from poor performance in Float16, especially LLMs.

Recently, I needed to export to CoreML models integrating a LLM component such a Open-CLIP and UForm but I faced accuracy issues when using FP16 conversion. More precisely, this causes overflows in some layers resulting in an accuracy close to 0.

I was able to solve the issue by keeping the overflowing operators in FP32 precision and this mostly solved the issue (see the details in this issue).

Were you able to run the tests of the exporters library to know if this issue affects some models here too?