microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.85k stars 2.94k forks source link

Converting onnx to ORT with nnapi support #9054

Closed IdoZach closed 3 years ago

IdoZach commented 3 years ago

Hi, I converted an ONNX model to ORT (On Linux with onnxruntime 1.8.2 Python package), and then used it in an Android application with libonnxruntime.so compiled with --use_nnapi and using the nnapi option:

SessionOptions options = new SessionOptions();
options.addNnapi();

It was considerably slower than running on cpu without the addNnpi() options above. I thought that maybe the issue is that I converted the ONNX to ORT without awareness for nnapi, so I tried to compile onnxruntime with --build_wheel --use_nnapi and used that Python package to convert, but the results were identical.

When running, I get this warning on the Android debugger:

W/onnxruntime:  [W:onnxruntime:ort-java, nnapi_execution_provider.cc:178 GetCapability] NnapiExecutionProvider::GetCapability, number of partitions supported by NNAPI: 17 number of nodes in the graph: 87 number of nodes supported by NNAPI: 53

My network contains the following layers:

ai.onnx;6;Relu
ai.onnx;7;Add
ai.onnx;11;Conv,MaxPool,ReduceMean
com.microsoft;1;FusedConv

How can I gain inference improvement over cpu instead of the current state? Thanks!

manashgoswami commented 3 years ago

the 17 partitions for NNAPI is not optimal for the execution flow. That is the overhead for NNAPI when compared to CPU execution. Can you share any more details about the model? Is there any opensource reference you can point to?

guoyu-wang commented 3 years ago

The NNAPI EP does not yet support ReduceMean and FusedConv, this may be the reason the model is partitioned into small fragments, please share the model if possible.

IdoZach commented 3 years ago

Hi, the model is prepared with Pytorch by using the first layers of torchvision's ResNet50 and an additional mean at the end (not shown). The FusedConv layer is created via the ORT converter - so how can I prevent this from happening? in the ONNX itself it's a standard conv.

Example code for preparing the model:

resnet = torchvision.models.resnet50(pretrained=True)
seq = torch.nn.Sequential(resnet.conv1, resnet.bn1, resnet.relu, resnet.maxpool,
            resnet.layer1, resnet.layer2, resnet.layer3, resnet.layer4)
guoyu-wang commented 3 years ago

The onnx model output of seq from your code snippet is entirely supported by NNAPI, try to use the --optimization_level=basic when you convert onnx model to .ort format

IdoZach commented 3 years ago

Thanks, I did that and the model now runs on the GPU with nnapi. However, it now runs in about the same speed as cpu-only, I used either without flags or with NNAPIFlags.CPU_DISABLED. How can I get a gain in speed over the cpu? is there a profiler or a more verbose run mode to check out what's being done on the cpu and gpu of the phone?

guoyu-wang commented 3 years ago

The onnx model output of seq from your code snippet, can run about 50% faster (230ms vs 450ms) using NNAPI vs CPU (4x core), on a Google Pixel 3a. The result may be highly dependent on the hardware and model. Make sure your model is using static input shape, and please share the onnx model if possible.

IdoZach commented 3 years ago

The model: https://github.com/IdoZach/onnx/blob/master/model.onnx.gz. It was converted with opset 11.
I use a fixed input size of 3x128x128 for these experiments.

guoyu-wang commented 3 years ago

If you use all cores on you phone, the NNAPI probably won't get better speed than CPU, especially with the modern mobile CPU, however, what NNAPI can do better here is by off loading the computation to GPU or DSP to have better system responsiveness with almost no CPU usage and a much lower power consumption.

Another way to improve NNAPI performance is try to statically quantize the model. See https://fs-eire.github.io/onnxruntime/docs/how-to/quantization.html

IdoZach commented 3 years ago

Thanks. I tried to quantize with static quantization, but in this case the layers become more complicated (see image for part of the graph), and when rebuilding the shared libraries for the new config (see below), I get back to the erronous nnapi execution:

 number of partitions supported by NNAPI: 49 number of nodes in the graph: 380 number of nodes supported by NNAPI: 225

And it does not even complete the run but crashes:

W/System.err: ai.onnxruntime.OrtException: Error code - ORT_FAIL - message: inference_session.cc:1217 AssignNodesToEpsFromHashesImpl Failed to find kernel def hash in kernel registries for Cast(9) node with name '647_output_quantized_cast'.

config:

ai.onnx;6;Relu
ai.onnx;7;Add,Mul
ai.onnx;9;Cast
ai.onnx;10;ConvInteger
ai.onnx;11;DynamicQuantizeLinear,MaxPool

Graph: image

guoyu-wang commented 3 years ago

Please use static quantization.

IdoZach commented 3 years ago

Thanks, I had the wrong quantization. Now the latency is improved by about 35-40%. Are there more options to speed up the inference time on GPU?

guoyu-wang commented 3 years ago

Probably this has already reached the limitation of your hardware. Please share the quantized onnx model and phone specs.

IdoZach commented 3 years ago

Quantized model: https://github.com/IdoZach/onnx/blob/master/model_quant.onnx.gz, Pixel 4a.

guoyu-wang commented 3 years ago

The only improvement can be made is to make this MaxPool running using uint8 instead of float such that the DqeuantizeLinear before it and the QuantizeLinear after it can be removed. And the whole graph can be run on DSP instead of switching between GPU and DSP. This could be done in the process of quantization.

Please open a separate issue on Quantization tools about this. Please include the float onnx model, quantized model and you quantization code snippet in the new issue.

image