[Mobile] How to run model inference with ARM GPU on android device?

zjc664656505 commented 10 months ago

Describe the issue

We are currently developing a system that involves deploying Large Language Models (LLMs) on Android smartphones. To date, we've managed to execute inference tasks using ONNX Runtime with the CPU Execution Provider, but the process is regrettably slow. Our goal is to leverage the built-in hardware accelerators, such as the GPU, to expedite the inference process. The specific GPU integrated into our Android devices is the ARM Mali-G710 MP7.

In an attempt to utilize the device's GPU, we've experimented with ONNX Runtime in conjunction with NNAPI. Unfortunately, it appears that NNAPI defaults to using the edgetpu for inference tasks, which is not currently supported by ONNX Runtime. Additionally, we've ensured the compatibility of our model with ORT Mobile, NNAPI, and CoreML by validating it with the onnxruntime.tools.check_onnx_model_mobile_usability tool.

Below is the log corresponding to our attempts:

python -m onnxruntime.tools.check_onnx_model_mobile_usability /home/junchen/LinguaLinked/onnx_model/backup/bloom3b_quantized_int8_res/module0/module_0_quant.onnx --log_level debug
INFO:  Checking /home/junchen/LinguaLinked/onnx_model/backup/bloom3b_quantized_int8_res/module0/module_0_quant.onnx for usability with ORT Mobile.
INFO:  Checking NNAPI
INFO:  64 partitions with a total of 504/758 nodes can be handled by the NNAPI EP.
INFO:  Partition sizes: [9, 20, 6, 1, 3, 8, 4, 9, 13, 14, 5, 8, 5, 1, 3, 8, 14, 13, 14, 5, 8, 5, 1, 3, 8, 14, 13, 14, 5, 8, 5, 1, 3, 8, 14, 13, 14, 5, 8, 5, 1, 3, 8, 14, 13, 14, 5, 8, 5, 1, 3, 8, 14, 13, 14, 5, 8, 5, 1, 3, 8, 14, 13, 5]
INFO:  Unsupported nodes due to operator=109
INFO:  Unsupported nodes due to input having a dynamic shape=145
INFO:  Unsupported ops: ai.onnx:ConstantOfShape,ai.onnx:CumSum,ai.onnx:DynamicQuantizeLinear,ai.onnx:Equal,ai.onnx:Expand,ai.onnx:Less,ai.onnx:MatMulInteger,ai.onnx:Not,ai.onnx:Or,ai.onnx:Range,ai.onnx:ScatterND,ai.onnx:Shape,ai.onnx:Where
DEBUG:  Caveats that have not been checked and may result in a node not being supported:  
     ai.onnx:DequantizeLinear:All quantization scales and zero points should be constant.
     ai.onnx:Gather:Input indices should be constant if not int32 type.
     ai.onnx:Unsqueeze:Input axes should be constant.
INFO:  NNAPI is not recommended with this model as there are 64 partitions covering 66.5% of the nodes in the model. This will most likely result in worse performance than just using the CPU EP.
INFO:  Model should perform well with NNAPI as is: NO
INFO:  Checking if model will perform better if the dynamic shapes are fixed...
INFO:  Partition information if the model was updated to make the shapes fixed:
INFO:  58 partitions with a total of 649/758 nodes can be handled by the NNAPI EP.
INFO:  Partition sizes: [29, 34, 16, 10, 2, 3, 10, 14, 13, 14, 11, 14, 5, 3, 10, 14, 13, 14, 11, 14, 5, 3, 10, 14, 13, 14, 11, 14, 5, 3, 10, 14, 13, 14, 11, 14, 5, 3, 10, 14, 13, 14, 11, 14, 5, 3, 10, 14, 13, 14, 11, 14, 5, 3, 10, 14, 13, 14]
INFO:  Unsupported nodes due to operator=109
INFO:  Unsupported ops: ai.onnx:ConstantOfShape,ai.onnx:CumSum,ai.onnx:DynamicQuantizeLinear,ai.onnx:Equal,ai.onnx:Expand,ai.onnx:Less,ai.onnx:MatMulInteger,ai.onnx:Not,ai.onnx:Or,ai.onnx:Range,ai.onnx:ScatterND,ai.onnx:Shape,ai.onnx:Where
DEBUG:  Caveats that have not been checked and may result in a node not being supported:  
     ai.onnx:DequantizeLinear:All quantization scales and zero points should be constant.
     ai.onnx:Gather:Input indices should be constant if not int32 type.
     ai.onnx:Unsqueeze:Input axes should be constant.
INFO:  NNAPI is not recommended with this model as there are 58 partitions covering 85.6% of the nodes in the model. This will most likely result in worse performance than just using the CPU EP.
INFO:  Model should perform well with NNAPI if modified to have fixed input shapes: NO
INFO:  Checking CoreML
INFO:  104 partitions with a total of 438/758 nodes can be handled by the CoreML EP.
INFO:  Partition sizes: [3, 7, 8, 6, 3, 1, 3, 1, 5, 4, 2, 5, 13, 5, 2, 5, 5, 3, 2, 3, 3, 3, 1, 5, 5, 2, 5, 13, 5, 2, 5, 5, 3, 2, 3, 3, 3, 1, 5, 5, 2, 5, 13, 5, 2, 5, 5, 3, 2, 3, 3, 3, 1, 5, 5, 2, 5, 13, 5, 2, 5, 5, 3, 2, 3, 3, 3, 1, 5, 5, 2, 5, 13, 5, 2, 5, 5, 3, 2, 3, 3, 3, 1, 5, 5, 2, 5, 13, 5, 2, 5, 5, 3, 2, 3, 3, 3, 1, 5, 5, 2, 5, 13, 5]
INFO:  Unsupported nodes due to operator=181
INFO:  Unsupported nodes due to input having a dynamic shape=139
INFO:  Unsupported ops: ai.onnx:ConstantOfShape,ai.onnx:CumSum,ai.onnx:DequantizeLinear,ai.onnx:DynamicQuantizeLinear,ai.onnx:Equal,ai.onnx:Expand,ai.onnx:Less,ai.onnx:MatMulInteger,ai.onnx:Not,ai.onnx:Or,ai.onnx:Range,ai.onnx:ReduceMean,ai.onnx:ScatterND,ai.onnx:Softmax,ai.onnx:Unsqueeze,ai.onnx:Where
DEBUG:  Caveats that have not been checked and may result in a node not being supported:  
     ai.onnx:Gather:Input `indices` with scalar value is not supported.
     ai.onnx:MatMul:Input B should be constant.
     ai.onnx:Pow:Only supports cases when both inputs are fp32.
     ai.onnx:Shape:Attribute `start` with non-default value is not supported. Attribute `end` is not supported.
     ai.onnx:Slice:Inputs `starts`, `ends`, `axes`, and `steps` should be constant. Empty slice is not supported.
INFO:  CoreML is not recommended with this model as there are 104 partitions covering 57.8% of the nodes in the model. This will most likely result in worse performance than just using the CPU EP.
INFO:  Model should perform well with CoreML as is: NO
INFO:  Checking if model will perform better if the dynamic shapes are fixed...
INFO:  Partition information if the model was updated to make the shapes fixed:
INFO:  89 partitions with a total of 577/758 nodes can be handled by the CoreML EP.
INFO:  Partition sizes: [4, 6, 16, 13, 6, 15, 10, 3, 6, 5, 5, 2, 5, 13, 5, 2, 5, 15, 10, 3, 6, 5, 5, 2, 5, 13, 5, 2, 5, 15, 10, 3, 6, 5, 5, 2, 5, 13, 5, 2, 5, 15, 10, 3, 6, 5, 5, 2, 5, 13, 5, 2, 5, 15, 10, 3, 6, 5, 5, 2, 5, 13, 5, 2, 5, 15, 10, 3, 6, 5, 5, 2, 5, 13, 5, 2, 5, 15, 10, 3, 6, 5, 5, 2, 5, 13, 5, 2, 5]
INFO:  Unsupported nodes due to operator=181
INFO:  Unsupported ops: ai.onnx:ConstantOfShape,ai.onnx:CumSum,ai.onnx:DequantizeLinear,ai.onnx:DynamicQuantizeLinear,ai.onnx:Equal,ai.onnx:Expand,ai.onnx:Less,ai.onnx:MatMulInteger,ai.onnx:Not,ai.onnx:Or,ai.onnx:Range,ai.onnx:ReduceMean,ai.onnx:ScatterND,ai.onnx:Softmax,ai.onnx:Unsqueeze,ai.onnx:Where
DEBUG:  Caveats that have not been checked and may result in a node not being supported:  
     ai.onnx:Gather:Input `indices` with scalar value is not supported.
     ai.onnx:MatMul:Input B should be constant.
     ai.onnx:Pow:Only supports cases when both inputs are fp32.
     ai.onnx:Shape:Attribute `start` with non-default value is not supported. Attribute `end` is not supported.
     ai.onnx:Slice:Inputs `starts`, `ends`, `axes`, and `steps` should be constant. Empty slice is not supported.
INFO:  CoreML is not recommended with this model as there are 89 partitions covering 76.1% of the nodes in the model. This will most likely result in worse performance than just using the CPU EP.
INFO:  Model should perform well with CoreML if modified to have fixed input shapes: NO
INFO:  ---------------
INFO:  Checking if pre-built ORT Mobile package can be used with /home/junchen/LinguaLinked/onnx_model/backup/bloom3b_quantized_int8_res/module0/module_0_quant.onnx once model is converted from ONNX to ORT format using onnxruntime.tools.convert_onnx_models_to_ort...
DEBUG:  Checking if the data types and operators used in the model are supported in the pre-built ORT package...
INFO:  Model should work with the pre-built package.
INFO:  ---------------

INFO:  Run `python -m onnxruntime.tools.convert_onnx_models_to_ort ...` to convert the ONNX model to ORT format. By default, the conversion tool will create an ORT format model with saved optimizations which can potentially be applied at runtime (with a .with_runtime_opt.ort file extension) for use with NNAPI or CoreML, and a fully optimized ORT format model (with a .ort file extension) for use with the CPU EP.
INFO:  For optimal performance the <model>.ort model should be used with the CPU EP.

Based on the log, it's apparent that our current model is not optimally compatible with NNAPI or CoreML for hardware acceleration on Android devices. Despite our efforts to validate the model and partition the operations, a substantial number of nodes remain unsupported, and the model is divided into numerous partitions, which hampers performance. Furthermore, the presence of dynamic shapes and certain unsupported operators such as 'ConstantOfShape', 'CumSum', and 'MatMulInteger' further complicates the utilization of hardware acceleration.

To proceed, we are considering the following steps and would appreciate any guidance or suggestions:

Model Optimization: We plan to revisit our model architecture and optimization strategies. Our goal is to minimize unsupported operations and dynamic shapes, as well as reduce the number of partitions when using NNAPI or CoreML. We would greatly benefit from any tips or best practices in optimizing models for these execution providers.
Alternative Execution Providers: Given the limitations we've encountered with NNAPI and CoreML, we are open to exploring other execution providers or acceleration frameworks that might be more compatible with our model and the ARM Mali-G710 MP7 GPU. If there are known providers or frameworks that have shown success with similar setups, we'd be keen to explore those.
Custom Operators: For the operators that are not supported out-of-the-box by NNAPI or CoreML, is it feasible and advisable to implement custom operators? We understand this could be a complex endeavor, but it might be a necessary step to achieve the performance we desire.
Direct GPU Inference: Bypassing high-level frameworks, is there a pathway to leverage the GPU more directly for inference tasks? We realize this might involve significant low-level programming and optimization, but if there are established approaches or libraries that can aid in this process, we would be interested in learning more.
Model Partitioning Strategy: The current partitioning does not seem to be beneficial. Would manual partitioning or a different strategy for partitioning the model be more effective in optimizing performance with hardware acceleration?

Thanks!

To reproduce

N/A.

Urgency

Not very urgent.

Platform

Android

OS Version

13

ONNX Runtime Installation

Built from Source

Compiler Version (if 'Built from Source')

No response

Package Name (if 'Released Package')

onnxruntime-android

ONNX Runtime Version or Commit ID

1.15

ONNX Runtime API

C++/C

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

github-actions[bot] commented 8 months ago

This issue has been automatically closed due to inactivity. Please reactivate if further support is needed.

skottmckay commented 8 months ago

CoreML now supports dynamic input shapes, although usage of them may cost performance. NNAPI doesn't support dynamic input shapes. Neither has a really good story for LLMs currently, and for mobile scenarios 4-bit quant may be required, but neither support it.

We add support for operators to CoreML/NNAPI on demand. Something like ConstantOfShape could probably be added. Not sure if there's a CumSum equivalent though. Dynamic quantization for ONNX involves DynamicQuantizeLinear and MatMulInteger which don't have direct equivalents in CoreML or NNAPI. We could try and implement those using lower-level operations but it's not clear how much performance gain would be lost by having to do so.

The partitioning is simply trying to assign as many connected nodes as possible to the EP. Unsupported operators break those partitions up.

NNAPI is meant to choose the best option from the devices available to it, and we don't have much control over its logic. I assume it's picking NPU as hitting the GPU hard could make the device's UI unresponsive. With NNAPI feature level 3 it may (untested) be possible to filter the devices we allow NNAPI to use via ANeuralNetworksCompilation_createForDevices and exclude NPU that way.

We don't have any other GPU specific EPs for mobile currently. There's a large cost to creating a new EP and the benefit is limited as it would be Android only or iOS only (AFAIK there's no GPU framework available on both), and if executing on GPU renders the device unresponsive using GPU may not be a viable approach for a real app.

We do have the QNN EP and that can potentially be used on Android, but it specific to a subset of Qualcomm chips, and IIRC does not support dynamic input shapes currently. I think the QNN libraries for Android were around 20MB so there's a significant hit to the app binary size.

You can implement custom ops, but they will break up the CoreML/NNAPI partitions so there will be a perf cost due to going between CPU and those EPs if you are trying to use both.

microsoft / onnxruntime