mlcommons / mobile_app_open

Mobile App Open
https://mlcommons.org/en/groups/inference-mobile/
Apache License 2.0
42 stars 20 forks source link

Continue updating the Apple backend #741

Open freedomtan opened 1 year ago

freedomtan commented 1 year ago

some ideas we can try to improve the apple backend

RSMNYS commented 9 months ago

@freedomtan for further improvements should we use the saved models from your repo (MobileBert, MobileDet)? Or we can use some models from the TensorFlow hub (At least for MobileBert model: https://tfhub.dev/tensorflow/mobilebert_en_uncased_L-24_H-128_B-512_A-4_F-4_OPT). As I see the saved models are with tf 1 version. However in new model the inputs are different than ours.

freedomtan commented 9 months ago

@freedomtan for further improvements should we use the saved models from your repo (MobileBert, MobileDet)? Or we can use some models from the TensorFlow hub (At least for MobileBert model: https://tfhub.dev/tensorflow/mobilebert_en_uncased_L-24_H-128_B-512_A-4_F-4_OPT). As I see the saved models are with tf 1 version. However in new model the inputs are different than ours.

I am not proud of my repo :-) For MobileBERT: whatever we do, it should be compatible (and mathematically equivalent) with what Google colleagues contribulted at https://github.com/mlcommons/mobile_open/tree/main/language/bert. For MobileDet: see https://github.com/mlcommons/mobile_open/tree/main/vision/mobilenet

We should check the accuracies of models.

As far as I can tell the https://tfhub.dev/tensorflow/mobilebert_en_uncased_L-24_H-128_B-512_A-4_F-4_OPT is not for SQuAD (hence not compatible)

RSMNYS commented 8 months ago

Hi guys! So I've converted the MobileBERT using the coreMLTools version 7, TensorFlow v 2.12 to the .mlpackage format, as well, as optimised the model using quantization technique. Currently I have the problem to use the .mlpackage format in our application. The problem arises when do on device compilation to receive the mlmodelc. I've tried to compile on the Mac itself and then use the compiled model, but then some issue with loading its content. So working to resolve this to see how accurate is the optimised model.

When working on the task I found such issues/possible improvements:

  1. When do the on device model compilation we use some deprecated method, which compile the model synchronously. There are alternatives which uses async methods or method with the callback. So would be good to revise this. The problem here is that app expects configured CoreMLExecutor after init.
  2. flutter has some issues while debugging on the device with iOS 17. (So can't debug directly), some patches exists, but the recommended in the doc flutter version is not the official one, and can't be updated, maybe we can revise this as well, and try to update the flutter to latest, at least it will be easier to maintain in future.
freedomtan commented 8 months ago

Hi guys! So I've converted the MobileBERT using the coreMLTools version 7, TensorFlow v 2.12 to the .mlpackage format, as well, as optimised the model using quantization technique. Currently I have the problem to use the .mlpackage format in our application. The problem arises when do on device compilation to receive the mlmodelc. I've tried to compile on the Mac itself and then use the compiled model, but then some issue with loading its content. So working to resolve this to see how accurate is the optimised model.

When working on the task I found such issues/possible improvements:

  1. When do the on device model compilation we use some deprecated method, which compile the model synchronously. There are alternatives which uses async methods or method with the callback. So would be good to revise this. The problem here is that app expects configured CoreMLExecutor after init.
  2. flutter has some issues while debugging on the device with iOS 17. (So can't debug directly), some patches exists, but the recommended in the doc flutter version is not the official one, and can't be updated, maybe we can revise this as well, and try to update the flutter to latest, at least it will be easier to maintain in future.

@RSMNYS I don't really get what you ran into. From my past experiences, if we can make the //flutter/cpp/binary:main work on macOS, mostly the app will work on iOS.

And for performance, please check if you got latency improvement in Xcode's / Instrument's Core ML Performance Report first.

RSMNYS commented 8 months ago

@RSMNYS I don't really get what you ran into. From my past experiences, if we can make the //flutter/cpp/binary:main work on macOS, mostly the app will work on iOS.

The thing is it works for main (when doing tests), and it loads the ml program with no issues. But when trying in the app the error says can't read the spec. Will continue with this today.

RSMNYS commented 8 months ago

Hi guys! Here are the results of the inferences by using the original MobileBERT (mlmodel) and the new converted models (mlpackage and optimized mlpackage). For the optimized one we used the default int8 quantized data type). As we can see the converted mlpackage has worse results than the original mlmodel. Need to check what could be the problem. As for the quantized model all seems correct as we used the lower precision for the data type (int8 and not float16)m that's why worse results.

MLPackage is the directory and not the single file. So to have the fingerprint for it we need to archive it. To correctly handle the archive with the mlpackage I've adjusted the archive_cached_helper. So now the app can load the mlpackage and do the inferences. Still have some difficulties with the model path after app restarts, because the logic returns only the path the archive's folder.

In our case we have: https://github.com/RSMNYS/mobile_models/raw/main/v3_0/CoreML/MobileBERT.zip. After download and unarchive the model is saved to ../raw/main/v3_0/CoreML/MobileBERT/MobileBERT.mlpackage. After app restart (app uses cached resources) he app returns this model_path: ../raw/main/v3_0/CoreML/MobileBERT, which is not correct. I think we can resolve this by introducing the new property to the pbtxt settings: model_name, so we can compose the model_path correctly and support models type which are not just single file, but package(directory). Please let's discuss.

optimization_results

freedomtan commented 8 months ago

@RSMNYS Please check model performance with Xcode Performance tab and/or Core ML Instruments first. For performance benchmark, it's hard to ask people to believe that we have "improved" model which is 1 - (92.14/121.71) = 24% slower than the original one.

anhappdev commented 8 months ago

@RSMNYS Can you try rename ‘MobileBERT.zip’ to ‘MobileBERT.mlpackage.zip’

freedomtan commented 8 months ago

@RSMNYS please share your forked repo or one the .mlpackage model.

RSMNYS commented 8 months ago

@freedomtan here is the forked repo: https://github.com/RSMNYS/mobile_models/raw/main/v3_0/CoreML/MobileBERT.mlpackage.zip

freedomtan commented 8 months ago

@freedomtan here is the forked repo: https://github.com/RSMNYS/mobile_models/raw/main/v3_0/CoreML/MobileBERT.mlpackage.zip

Let's check something basic.

  1. Did you try to open the model you converted in Xcode and run it in the Performance tab as in Xcode Performance tab and/or Core ML Instruments? When I tried to run your .mlpackage model with "CPU and ANE", I got error messages. Mostly, there is something wrong. And if you compare running my .mlmodel and your .mlpackage, you can .mlmodel is faster because it's CPU+ANE and your .mlpackage is on GPU+ANE.
  2. I converted the .mlmodel I had at https://github.com/freedomtan/coreml_models_for_mlperf/tree/main/mobilebert to .mlpackage by opening in Xcode 15.0, clicking the "Edit" button, and then accepting the conversion. I got a model in .mlpackage. And running the model with CPU and ANE is roughly as good as running the .mlmodel one.
  3. if you want to debug the converted model, maybe you can start from its MIL https://apple.github.io/coremltools/docs-guides/source/model-intermediate-language.html
RSMNYS commented 8 months ago

@freedomtan here is my test with the converted model:

Screenshot 2023-11-16 at 11 28 04 Screenshot 2023-11-16 at 11 31 13

Sometimes Xcode fails to create the report, but I believe this is the Xcode issue, because it shows me sometimes the wrong operating system, but in the result all is listed correctly.

Can you share the general tab for the converted model, the results, and version of the Xcode, please.

freedomtan commented 8 months ago

@RSMNYS I meant "CPU and ANE". "GPU and ANE" is the reason why your model is slower.

@freedomtan here is my test with the converted model: Screenshot 2023-11-16 at 11 28 04 Screenshot 2023-11-16 at 11 31 13

Sometimes Xcode fails to create the report, but I believe this is the Xcode issue, because it shows me sometimes the wrong operating system, but in the result all is listed correctly.

Can you share the general tab for the converted model, the results, and version of the Xcode, please.

freedomtan commented 7 months ago

@freedomtan to post profiling results how old coreml model work on couple devices.

freedomtan commented 7 months ago

@RSMNYS with the MobileBERT.mlmodel here, https://github.com/freedomtan/coreml_models_for_mlperf/tree/main/mobilebert comparing my .mlmodel and your .mlpackage in the Instruments, you can see, as I said, GPU takes much longer time than CPU.

With coremltools's converter, you can try to convert a TF model to MIL by setting the convert_to parameter to milinternal https://apple.github.io/coremltools/source/coremltools.converters.convert.html.

freedom's .mlmodel Sergie's .mlpackage
freedomtan commented 7 months ago

@RSMNYS

I dug a bit into it over the past weekend. Some information maybe useful.

And then, it should be possible to tweek .mil program.

RSMNYS commented 6 months ago

@freedomtan I did some more testing with MobileBERT.mlpackage. I've set different precisions for the model: Float16, and Float32 and here are the results:

FLOAT16

All units: 8.33 ms 1900 operations on NE, 8 op on GPU CPU: can't run CPU & GPU: 31.45 ms - all operations run on GPU CPU & NE: can't run

FLOAT32

All unit: 31.11 ms - all operations run on GPU only CPU only: 69.25 ms CPU & GPU: 30.7 ms - all operations run on GPU only CPU & NE: 69.13 ms - all operations run on CPU ony

Also found this description: ML programs use a GPU runtime that is backed by the Metal Performance Shaders Graph framework. So could it be that mlpackage is optimised to perform the operations on the gpu (to utilise parallel execution). And since nlp models has the sequence nature, it's not so beneficial to run on gpu. (In terms of qps). We can check other models (vision) to see if the operations are faster in this case. Checking more.

freedomtan commented 6 months ago

@freedomtan I did some more testing with MobileBERT.mlpackage. I've set different precisions for the model: Float16, and Float32 and here are the results:

FLOAT16

All units: 8.33 ms 1900 operations on NE, 8 op on GPU CPU: can't run CPU & GPU: 31.45 ms - all operations run on GPU CPU & NE: can't run

FLOAT32

All unit: 31.11 ms - all operations run on GPU only CPU only: 69.25 ms CPU & GPU: 30.7 ms - all operations run on GPU only CPU & NE: 69.13 ms - all operations run on CPU ony

Also found this description: ML programs use a GPU runtime that is backed by the Metal Performance Shaders Graph framework. So could it be that mlpackage is optimised to perform the operations on the gpu (to utilise parallel execution). And since nlp models has the sequence nature, it's not so beneficial to run on gpu. (In terms of qps). We can check other models (vision) to see if the operations are faster in this case. Checking more.

@RSMNYS The float16 and float32 results don't surprise me at all. As far as I know,

CPU: flp16, fp32 (and maybe bf16) GPU: fp16, bf16, and float32 ANE: fp16.

I recommend

  1. read https://machinelearning.apple.com/research/neural-engine-transformers,
  2. watch https://developer.apple.com/videos/play/wwdc2023/10047/, and
  3. read https://apple.github.io/coremltools/docs-guides/source/performance-impact.html
for .mlmodel and .mlpackage we discussed, mlmodel mlpackage

As you can see, running on GPU is slower.

With MIL program and netron, we can find what the 10 and 8 ops in mlmodel and mlpackage, respectively

Maybe we can change mlprogram manually to check what stopped the 8 ops from running on CPU.