Continue updating the Apple backend

freedomtan commented 1 year ago

some ideas we can try to improve the apple backend

[ ] in the WWDC 2023 "Use Core ML Tools for machine learning model compression", https://developer.apple.com/wwdc23/10047, Apple folks claimed that Apple's new quantization scheme could help reduce inference latency
- Apple used that to convert Stable Diffusion models, see https://github.com/apple/ml-stable-diffusion
[ ] some models definitely still have room for improvement, e.g.,
- the MobileDet was converted with freedom's quick-and-dirty script, and
- the MobileBERT could be improved by referring to Apple's Transformer on Neural Engine guide, https://machinelearning.apple.com/research/neural-engine-transformers

RSMNYS commented 9 months ago

@freedomtan for further improvements should we use the saved models from your repo (MobileBert, MobileDet)? Or we can use some models from the TensorFlow hub (At least for MobileBert model: https://tfhub.dev/tensorflow/mobilebert_en_uncased_L-24_H-128_B-512_A-4_F-4_OPT). As I see the saved models are with tf 1 version. However in new model the inputs are different than ours.

freedomtan commented 9 months ago

@freedomtan for further improvements should we use the saved models from your repo (MobileBert, MobileDet)? Or we can use some models from the TensorFlow hub (At least for MobileBert model: https://tfhub.dev/tensorflow/mobilebert_en_uncased_L-24_H-128_B-512_A-4_F-4_OPT). As I see the saved models are with tf 1 version. However in new model the inputs are different than ours.

I am not proud of my repo :-) For MobileBERT: whatever we do, it should be compatible (and mathematically equivalent) with what Google colleagues contribulted at https://github.com/mlcommons/mobile_open/tree/main/language/bert. For MobileDet: see https://github.com/mlcommons/mobile_open/tree/main/vision/mobilenet

We should check the accuracies of models.

As far as I can tell the https://tfhub.dev/tensorflow/mobilebert_en_uncased_L-24_H-128_B-512_A-4_F-4_OPT is not for SQuAD (hence not compatible)

RSMNYS commented 8 months ago

Hi guys! So I've converted the MobileBERT using the coreMLTools version 7, TensorFlow v 2.12 to the .mlpackage format, as well, as optimised the model using quantization technique. Currently I have the problem to use the .mlpackage format in our application. The problem arises when do on device compilation to receive the mlmodelc. I've tried to compile on the Mac itself and then use the compiled model, but then some issue with loading its content. So working to resolve this to see how accurate is the optimised model.

When working on the task I found such issues/possible improvements:

When do the on device model compilation we use some deprecated method, which compile the model synchronously. There are alternatives which uses async methods or method with the callback. So would be good to revise this. The problem here is that app expects configured CoreMLExecutor after init.
flutter has some issues while debugging on the device with iOS 17. (So can't debug directly), some patches exists, but the recommended in the doc flutter version is not the official one, and can't be updated, maybe we can revise this as well, and try to update the flutter to latest, at least it will be easier to maintain in future.

freedomtan commented 8 months ago

Hi guys! So I've converted the MobileBERT using the coreMLTools version 7, TensorFlow v 2.12 to the .mlpackage format, as well, as optimised the model using quantization technique. Currently I have the problem to use the .mlpackage format in our application. The problem arises when do on device compilation to receive the mlmodelc. I've tried to compile on the Mac itself and then use the compiled model, but then some issue with loading its content. So working to resolve this to see how accurate is the optimised model.

When working on the task I found such issues/possible improvements:

When do the on device model compilation we use some deprecated method, which compile the model synchronously. There are alternatives which uses async methods or method with the callback. So would be good to revise this. The problem here is that app expects configured CoreMLExecutor after init.

flutter has some issues while debugging on the device with iOS 17. (So can't debug directly), some patches exists, but the recommended in the doc flutter version is not the official one, and can't be updated, maybe we can revise this as well, and try to update the flutter to latest, at least it will be easier to maintain in future.

@RSMNYS I don't really get what you ran into. From my past experiences, if we can make the //flutter/cpp/binary:main work on macOS, mostly the app will work on iOS.

And for performance, please check if you got latency improvement in Xcode's / Instrument's Core ML Performance Report first.

RSMNYS commented 8 months ago

@RSMNYS I don't really get what you ran into. From my past experiences, if we can make the //flutter/cpp/binary:main work on macOS, mostly the app will work on iOS.

The thing is it works for main (when doing tests), and it loads the ml program with no issues. But when trying in the app the error says can't read the spec. Will continue with this today.

RSMNYS commented 8 months ago

Hi guys! Here are the results of the inferences by using the original MobileBERT (mlmodel) and the new converted models (mlpackage and optimized mlpackage). For the optimized one we used the default int8 quantized data type). As we can see the converted mlpackage has worse results than the original mlmodel. Need to check what could be the problem. As for the quantized model all seems correct as we used the lower precision for the data type (int8 and not float16)m that's why worse results.

MLPackage is the directory and not the single file. So to have the fingerprint for it we need to archive it. To correctly handle the archive with the mlpackage I've adjusted the archive_cached_helper. So now the app can load the mlpackage and do the inferences. Still have some difficulties with the model path after app restarts, because the logic returns only the path the archive's folder.

In our case we have: https://github.com/RSMNYS/mobile_models/raw/main/v3_0/CoreML/MobileBERT.zip. After download and unarchive the model is saved to ../raw/main/v3_0/CoreML/MobileBERT/MobileBERT.mlpackage. After app restart (app uses cached resources) he app returns this model_path: ../raw/main/v3_0/CoreML/MobileBERT, which is not correct. I think we can resolve this by introducing the new property to the pbtxt settings: model_name, so we can compose the model_path correctly and support models type which are not just single file, but package(directory). Please let's discuss.

optimization_results

freedomtan commented 8 months ago

@RSMNYS Please check model performance with Xcode Performance tab and/or Core ML Instruments first. For performance benchmark, it's hard to ask people to believe that we have "improved" model which is 1 - (92.14/121.71) = 24% slower than the original one.

anhappdev commented 8 months ago

@RSMNYS Can you try rename ‘MobileBERT.zip’ to ‘MobileBERT.mlpackage.zip’

freedomtan commented 8 months ago

@RSMNYS please share your forked repo or one the .mlpackage model.

RSMNYS commented 8 months ago

@freedomtan here is the forked repo: https://github.com/RSMNYS/mobile_models/raw/main/v3_0/CoreML/MobileBERT.mlpackage.zip

freedomtan commented 8 months ago

@freedomtan here is the forked repo: https://github.com/RSMNYS/mobile_models/raw/main/v3_0/CoreML/MobileBERT.mlpackage.zip

Let's check something basic.

Did you try to open the model you converted in Xcode and run it in the Performance tab as in Xcode Performance tab and/or Core ML Instruments? When I tried to run your .mlpackage model with "CPU and ANE", I got error messages. Mostly, there is something wrong. And if you compare running my .mlmodel and your .mlpackage, you can .mlmodel is faster because it's CPU+ANE and your .mlpackage is on GPU+ANE.
I converted the .mlmodel I had at https://github.com/freedomtan/coreml_models_for_mlperf/tree/main/mobilebert to .mlpackage by opening in Xcode 15.0, clicking the "Edit" button, and then accepting the conversion. I got a model in .mlpackage. And running the model with CPU and ANE is roughly as good as running the .mlmodel one.
if you want to debug the converted model, maybe you can start from its MIL https://apple.github.io/coremltools/docs-guides/source/model-intermediate-language.html

RSMNYS commented 8 months ago

@freedomtan here is my test with the converted model:

Sometimes Xcode fails to create the report, but I believe this is the Xcode issue, because it shows me sometimes the wrong operating system, but in the result all is listed correctly.

Can you share the general tab for the converted model, the results, and version of the Xcode, please.

freedomtan commented 8 months ago

@RSMNYS I meant "CPU and ANE". "GPU and ANE" is the reason why your model is slower.

@freedomtan here is my test with the converted model:

Sometimes Xcode fails to create the report, but I believe this is the Xcode issue, because it shows me sometimes the wrong operating system, but in the result all is listed correctly.

Can you share the general tab for the converted model, the results, and version of the Xcode, please.

freedomtan commented 7 months ago

@freedomtan to post profiling results how old coreml model work on couple devices.

freedomtan commented 7 months ago

@RSMNYS with the MobileBERT.mlmodel here, https://github.com/freedomtan/coreml_models_for_mlperf/tree/main/mobilebert comparing my .mlmodel and your .mlpackage in the Instruments, you can see, as I said, GPU takes much longer time than CPU.

With coremltools's converter, you can try to convert a TF model to MIL by setting the convert_to parameter to milinternal https://apple.github.io/coremltools/source/coremltools.converters.convert.html.

freedom's .mlmodel	Sergie's .mlpackage

freedomtan commented 7 months ago

@RSMNYS

I dug a bit into it over the past weekend. Some information maybe useful.

We can check graphs of both .mlmodel and .mlpackage with netron.
- for .mlmodel: simply netron MobileBERT.mlmodel works
- for .mlpackage: there is model.mlmodel in MobileBERT.mlpackage/Data/com.apple.CoreML/
we can get MIL programs matching graphs
- for .mlmodel: use convert_to = 'milinternal' as mentioned above
- for .mlpackage: xcrun mlmodelc compile MobileBERT.mlpackage /tmp, then we can find model.mil in /tmp/MobileBERT.mlmodelc/
With the graphs and MIL programs, it's possible to check the first 10 ops and 8 os of .mlmodel and .mlpackage, respectively.

And then, it should be possible to tweek .mil program.

RSMNYS commented 6 months ago

@freedomtan I did some more testing with MobileBERT.mlpackage. I've set different precisions for the model: Float16, and Float32 and here are the results:

FLOAT16

All units: 8.33 ms 1900 operations on NE, 8 op on GPU CPU: can't run CPU & GPU: 31.45 ms - all operations run on GPU CPU & NE: can't run

FLOAT32

All unit: 31.11 ms - all operations run on GPU only CPU only: 69.25 ms CPU & GPU: 30.7 ms - all operations run on GPU only CPU & NE: 69.13 ms - all operations run on CPU ony

Also found this description: ML programs use a GPU runtime that is backed by the Metal Performance Shaders Graph framework. So could it be that mlpackage is optimised to perform the operations on the gpu (to utilise parallel execution). And since nlp models has the sequence nature, it's not so beneficial to run on gpu. (In terms of qps). We can check other models (vision) to see if the operations are faster in this case. Checking more.

freedomtan commented 6 months ago

@freedomtan I did some more testing with MobileBERT.mlpackage. I've set different precisions for the model: Float16, and Float32 and here are the results:

FLOAT16

All units: 8.33 ms 1900 operations on NE, 8 op on GPU CPU: can't run CPU & GPU: 31.45 ms - all operations run on GPU CPU & NE: can't run

FLOAT32

All unit: 31.11 ms - all operations run on GPU only CPU only: 69.25 ms CPU & GPU: 30.7 ms - all operations run on GPU only CPU & NE: 69.13 ms - all operations run on CPU ony

Also found this description: ML programs use a GPU runtime that is backed by the Metal Performance Shaders Graph framework. So could it be that mlpackage is optimised to perform the operations on the gpu (to utilise parallel execution). And since nlp models has the sequence nature, it's not so beneficial to run on gpu. (In terms of qps). We can check other models (vision) to see if the operations are faster in this case. Checking more.

@RSMNYS The float16 and float32 results don't surprise me at all. As far as I know,

CPU: flp16, fp32 (and maybe bf16) GPU: fp16, bf16, and float32 ANE: fp16.

I recommend

for .mlmodel and .mlpackage we discussed,	mlmodel	mlpackage

As you can see, running on GPU is slower.

With MIL program and netron, we can find what the 10 and 8 ops in mlmodel and mlpackage, respectively

mlmodel:
- 0: buffer conversion
- 1-3: the 2 nodes after 'segment_ids'
- 4-5: the 2 nodes after 'input_mask'
- 6-7: the 3 nodes after 'input_ids'
- 8-9: loadConst and the add after 3
mlpackage:
- 0-2: bufer conversion
- 3: the node after 'input_ids'
- 4: the gather after 'segment_ids' → cast, cast folded
- 5, 6 : the cast and reshape after 'input_mask'
- 7: the gather after 3, the expand_dims, cast folded

Maybe we can change mlprogram manually to check what stopped the 8 ops from running on CPU.

mlcommons / mobile_app_open

Continue updating the Apple backend #741