Closed zjc664656505 closed 8 months ago
This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.
This issue has been automatically closed due to inactivity. Please reactivate if further support is needed.
CoreML now supports dynamic input shapes, although usage of them may cost performance. NNAPI doesn't support dynamic input shapes. Neither has a really good story for LLMs currently, and for mobile scenarios 4-bit quant may be required, but neither support it.
We add support for operators to CoreML/NNAPI on demand. Something like ConstantOfShape could probably be added. Not sure if there's a CumSum equivalent though. Dynamic quantization for ONNX involves DynamicQuantizeLinear and MatMulInteger which don't have direct equivalents in CoreML or NNAPI. We could try and implement those using lower-level operations but it's not clear how much performance gain would be lost by having to do so.
The partitioning is simply trying to assign as many connected nodes as possible to the EP. Unsupported operators break those partitions up.
NNAPI is meant to choose the best option from the devices available to it, and we don't have much control over its logic. I assume it's picking NPU as hitting the GPU hard could make the device's UI unresponsive. With NNAPI feature level 3 it may (untested) be possible to filter the devices we allow NNAPI to use via ANeuralNetworksCompilation_createForDevices and exclude NPU that way.
We don't have any other GPU specific EPs for mobile currently. There's a large cost to creating a new EP and the benefit is limited as it would be Android only or iOS only (AFAIK there's no GPU framework available on both), and if executing on GPU renders the device unresponsive using GPU may not be a viable approach for a real app.
We do have the QNN EP and that can potentially be used on Android, but it specific to a subset of Qualcomm chips, and IIRC does not support dynamic input shapes currently. I think the QNN libraries for Android were around 20MB so there's a significant hit to the app binary size.
You can implement custom ops, but they will break up the CoreML/NNAPI partitions so there will be a perf cost due to going between CPU and those EPs if you are trying to use both.
Describe the issue
We are currently developing a system that involves deploying Large Language Models (LLMs) on Android smartphones. To date, we've managed to execute inference tasks using ONNX Runtime with the CPU Execution Provider, but the process is regrettably slow. Our goal is to leverage the built-in hardware accelerators, such as the GPU, to expedite the inference process. The specific GPU integrated into our Android devices is the ARM Mali-G710 MP7.
In an attempt to utilize the device's GPU, we've experimented with ONNX Runtime in conjunction with NNAPI. Unfortunately, it appears that NNAPI defaults to using the edgetpu for inference tasks, which is not currently supported by ONNX Runtime. Additionally, we've ensured the compatibility of our model with ORT Mobile, NNAPI, and CoreML by validating it with the onnxruntime.tools.check_onnx_model_mobile_usability tool.
Below is the log corresponding to our attempts:
Based on the log, it's apparent that our current model is not optimally compatible with NNAPI or CoreML for hardware acceleration on Android devices. Despite our efforts to validate the model and partition the operations, a substantial number of nodes remain unsupported, and the model is divided into numerous partitions, which hampers performance. Furthermore, the presence of dynamic shapes and certain unsupported operators such as 'ConstantOfShape', 'CumSum', and 'MatMulInteger' further complicates the utilization of hardware acceleration.
To proceed, we are considering the following steps and would appreciate any guidance or suggestions:
Model Optimization: We plan to revisit our model architecture and optimization strategies. Our goal is to minimize unsupported operations and dynamic shapes, as well as reduce the number of partitions when using NNAPI or CoreML. We would greatly benefit from any tips or best practices in optimizing models for these execution providers.
Alternative Execution Providers: Given the limitations we've encountered with NNAPI and CoreML, we are open to exploring other execution providers or acceleration frameworks that might be more compatible with our model and the ARM Mali-G710 MP7 GPU. If there are known providers or frameworks that have shown success with similar setups, we'd be keen to explore those.
Custom Operators: For the operators that are not supported out-of-the-box by NNAPI or CoreML, is it feasible and advisable to implement custom operators? We understand this could be a complex endeavor, but it might be a necessary step to achieve the performance we desire.
Direct GPU Inference: Bypassing high-level frameworks, is there a pathway to leverage the GPU more directly for inference tasks? We realize this might involve significant low-level programming and optimization, but if there are established approaches or libraries that can aid in this process, we would be interested in learning more.
Model Partitioning Strategy: The current partitioning does not seem to be beneficial. Would manual partitioning or a different strategy for partitioning the model be more effective in optimizing performance with hardware acceleration?
Thanks!
To reproduce
N/A.
Urgency
Not very urgent.
Platform
Android
OS Version
13
ONNX Runtime Installation
Built from Source
Compiler Version (if 'Built from Source')
No response
Package Name (if 'Released Package')
onnxruntime-android
ONNX Runtime Version or Commit ID
1.15
ONNX Runtime API
C++/C
Architecture
X64
Execution Provider
Default CPU
Execution Provider Library Version
No response