Closed IdoZach closed 3 years ago
the 17 partitions for NNAPI is not optimal for the execution flow. That is the overhead for NNAPI when compared to CPU execution. Can you share any more details about the model? Is there any opensource reference you can point to?
The NNAPI EP does not yet support ReduceMean
and FusedConv
, this may be the reason the model is partitioned into small fragments, please share the model if possible.
Hi, the model is prepared with Pytorch by using the first layers of torchvision's ResNet50 and an additional mean at the end (not shown). The FusedConv layer is created via the ORT converter - so how can I prevent this from happening? in the ONNX itself it's a standard conv.
Example code for preparing the model:
resnet = torchvision.models.resnet50(pretrained=True)
seq = torch.nn.Sequential(resnet.conv1, resnet.bn1, resnet.relu, resnet.maxpool,
resnet.layer1, resnet.layer2, resnet.layer3, resnet.layer4)
The onnx model output of seq
from your code snippet is entirely supported by NNAPI, try to use the --optimization_level=basic
when you convert onnx model to .ort format
Thanks, I did that and the model now runs on the GPU with nnapi. However, it now runs in about the same speed as cpu-only, I used either without flags or with NNAPIFlags.CPU_DISABLED
. How can I get a gain in speed over the cpu? is there a profiler or a more verbose run mode to check out what's being done on the cpu and gpu of the phone?
The onnx model output of seq
from your code snippet, can run about 50% faster (230ms vs 450ms) using NNAPI vs CPU (4x core), on a Google Pixel 3a. The result may be highly dependent on the hardware and model. Make sure your model is using static input shape, and please share the onnx model if possible.
The model: https://github.com/IdoZach/onnx/blob/master/model.onnx.gz
. It was converted with opset 11.
I use a fixed input size of 3x128x128
for these experiments.
If you use all cores on you phone, the NNAPI probably won't get better speed than CPU, especially with the modern mobile CPU, however, what NNAPI can do better here is by off loading the computation to GPU or DSP to have better system responsiveness with almost no CPU usage and a much lower power consumption.
Another way to improve NNAPI performance is try to statically quantize the model. See https://fs-eire.github.io/onnxruntime/docs/how-to/quantization.html
Thanks. I tried to quantize with static quantization, but in this case the layers become more complicated (see image for part of the graph), and when rebuilding the shared libraries for the new config (see below), I get back to the erronous nnapi execution:
number of partitions supported by NNAPI: 49 number of nodes in the graph: 380 number of nodes supported by NNAPI: 225
And it does not even complete the run but crashes:
W/System.err: ai.onnxruntime.OrtException: Error code - ORT_FAIL - message: inference_session.cc:1217 AssignNodesToEpsFromHashesImpl Failed to find kernel def hash in kernel registries for Cast(9) node with name '647_output_quantized_cast'.
config:
ai.onnx;6;Relu
ai.onnx;7;Add,Mul
ai.onnx;9;Cast
ai.onnx;10;ConvInteger
ai.onnx;11;DynamicQuantizeLinear,MaxPool
Graph:
Please use static quantization.
Thanks, I had the wrong quantization. Now the latency is improved by about 35-40%. Are there more options to speed up the inference time on GPU?
Probably this has already reached the limitation of your hardware. Please share the quantized onnx model and phone specs.
Quantized model: https://github.com/IdoZach/onnx/blob/master/model_quant.onnx.gz, Pixel 4a.
The only improvement can be made is to make this MaxPool
running using uint8
instead of float
such that the DqeuantizeLinear
before it and the QuantizeLinear
after it can be removed. And the whole graph can be run on DSP instead of switching between GPU and DSP. This could be done in the process of quantization.
Please open a separate issue on Quantization tools about this. Please include the float onnx model, quantized model and you quantization code snippet in the new issue.
Hi, I converted an ONNX model to ORT (On Linux with onnxruntime 1.8.2 Python package), and then used it in an Android application with libonnxruntime.so compiled with
--use_nnapi
and using the nnapi option:It was considerably slower than running on cpu without the
addNnpi()
options above. I thought that maybe the issue is that I converted the ONNX to ORT without awareness for nnapi, so I tried to compile onnxruntime with--build_wheel --use_nnapi
and used that Python package to convert, but the results were identical.When running, I get this warning on the Android debugger:
My network contains the following layers:
How can I gain inference improvement over cpu instead of the current state? Thanks!