Closed laclouis5 closed 1 year ago
Thanks a lot @laclouis5! Very interesting observations!
The philosophy we have been applying is to preserve the weights in their original precision when possible, but I tend to agree that it probably makes sense to default to 16-bit for exporters
. If you'd like to propose a PR I'd be happy to review it; otherwise I'll get to it in a few days :)
Very interesting comment about argmax
performance too. Do you think something like https://developer.apple.com/documentation/accelerate/bnns/reductionfunction/argmax could be faster in this particular case? Another option could be to write a custom Metal kernel, but that sounds similar to what appears to be happening under the hood.
I just opened a PR that just changes the default behavior (#58).
For the argmax
thing, I did not try the BNNS or MPS custom kernel. Would be great to compare but I assume that Apple used its best implementation in CoreML anyway. I rapidly tested the performance of argmax
in PyTorch using MPS just to get an idea and the operation seems slow there too. I got 300 ms for an argmax
on the channel dim for a tensor of shape 1x1000x512x512.
This may not be easily solvable. However, having 1000 classes for semantic segmentation is very rare I guess, so this may not be an issue in practice.
In the PR implementing the changes mentioned above I mentioned that in my experience I never observed a drop in performance when using FP16 on vision models but that did not try LLM's:
I personally never observed a significant drop in performance when exporting vision models to CoreML in Float16 (Yolov8, UNets, etc.). I wasn't able to run the tests because of dependency issues but this would be great to know if a network suffers from poor performance in Float16, especially LLMs.
Recently, I needed to export to CoreML models integrating a LLM component such a Open-CLIP and UForm but I faced accuracy issues when using FP16 conversion. More precisely, this causes overflows in some layers resulting in an accuracy close to 0.
I was able to solve the issue by keeping the overflowing operators in FP32 precision and this mostly solved the issue (see the details in this issue).
Were you able to run the tests of the exporters
library to know if this issue affects some models here too?
I was trying to export Segformer models to CoreML but the exported model is slow compared to the same model exported on my own.
I tried to export the model using the following command:
This model median prediction time is 500ms on my MacBook Pro M1 using all the available accelerators (ANE, GPU, CPU), above the 300ms of the same model exported on my own using
coremltools
directly.I did a little of profiling to identify the issue using Xcode Instruments. It look like the model is exported and executed in Float32. This greatly undermined the performance since Float16 data is required for the ANE to be used. Thus, the ANE is not used at all and the model is executed on GPU only on most devices. Also, Float32 computations are slower than Float16 computations on the GPU, thus Float32 should be avoided when possible. In the
coremltools
documentation Apple suggests to use Float16 as a default and as of version 7.0 Float16 is the default precision for CoreML exports.With the option
--quantize=float16
the inference time is on par with the model I exported (around 300ms). I suggest to use thecoremltools
default Float16 precision instead of Float32 in order to get the most of the specialized hardware or Apple platforms.I also noted another issue but not related to the
exporters
framework. In Float16 and with the ANE, the Instruments trace suggests that half of the prediction time is spent in GPU kernels. That is weird since only 1 operator is executed on the GPU in this case: theargmax
operation at the end of the model. This slowdown needs further investigation but this may be due to the large size of the input tensor (1000x512x512). I tried with only 16 output classes and the inference time drop down to 60ms.