Open salvadog opened 6 days ago
@salvadog is ONNX using CUDA or CPU? ExecuTorch is designed for mobile deployment and does not have a CUDA backend.
@salvadog is ONNX using CUDA or CPU? ExecuTorch is designed for mobile deployment and does not have a CUDA backend.
Yeah I know, ONNX is also using CPU. I run a 300M-vit model on Android with 8 3 448 * 448 input, and the inference latency is quite high, about 200s, much slower than llama2B on Android, with TTFT 0.5s + 30 tokens/, and also much slower than vit ONNX inference on Android. Both llama2B and VIT are running with XNNPACK backend.
@salvadog is ONNX using CUDA or CPU? ExecuTorch is designed for mobile deployment and does not have a CUDA backend.
Yeah I know, ONNX is also using CPU. I run a 300M-vit model on Android with 8 3 448 * 448 input, and the inference latency is quite high, about 200s, much slower than llama2B on Android, with TTFT 0.5s + 30 tokens/, and also much slower than vit ONNX inference on Android. Both llama2B and VIT are running with XNNPACK backend.
Some things to check are to make sure ExecuTorch is built with release mode, and to check how much of the model graph is lowered to XNNPACK vs. running the portable ops in ExecuTorch (print the graph after running to_backend).
Another thing to call out is because ExecuTorch is focused on mobile, we usually have better performance on ARM CPU vs. x86.
cc @digantdesai for ExecuTorch vs.. ONNX perf issues with XNNPACK
@salvadog is ONNX using CUDA or CPU? ExecuTorch is designed for mobile deployment and does not have a CUDA backend.
Yeah I know, ONNX is also using CPU. I run a 300M-vit model on Android with 8 3 448 * 448 input, and the inference latency is quite high, about 200s, much slower than llama2B on Android, with TTFT 0.5s + 30 tokens/, and also much slower than vit ONNX inference on Android. Both llama2B and VIT are running with XNNPACK backend.
Some things to check are to make sure ExecuTorch is built with release mode, and to check how much of the model graph is lowered to XNNPACK vs. running the portable ops in ExecuTorch (print the graph after running to_backend).
Another thing to call out is because ExecuTorch is focused on mobile, we usually have better performance on ARM CPU vs. x86.
cc @digantdesai for ExecuTorch vs.. ONNX perf issues with XNNPACK
I've made sure ExecuTorch is built with relase mode. My main concern is the inference speed is good for ExecuTorch llama2-2B on Android, but quite slow for VIT under similar export method and settings. Is this an expected behavior or something goes wrong. @metascroy @digantdesai
Thanks @salvadog for trying this out. And I am glad Llama is running with decent perf for you on the Android phone.
onnx inference time on linux pc: 12s
vit executorch inference time on linux pc: 450s
vit executorch inference time on Android: 200s
This is not what I would expect. I guess some operators could be running on the reference (also slow) implementation and not on XNNPACK.
check how much of the model graph is lowered to XNNPACK vs. running the portable ops in ExecuTorch (print the graph after running to_backend).
As @metascroy suggested, can we try this?
Thanks @salvadog for trying this out. And I am glad Llama is running with decent perf for you on the Android phone.
onnx inference time on linux pc: 12s vit executorch inference time on linux pc: 450s vit executorch inference time on Android: 200s
This is not what I would expect. I guess some operators could be running on the reference (also slow) implementation and not on XNNPACK.
check how much of the model graph is lowered to XNNPACK vs. running the portable ops in ExecuTorch (print the graph after running to_backend).
As @metascroy suggested, can we try this?
Thanks for helping out! My export commands are
Llama:
python -m examples.models.llama2.export_llama --checkpoint /XXX/checkpoint.pth \ -p /XXX/config.json \ -kv --use_sdpa_with_kv_cache -X -qmode 8da4w --group_size 128 -d fp32 \ --metadata '{"get_bos_id":1, "get_eos_id":2}' \ --embedding-quantize 4,32 --output_name="internlm2_2B_kv_sdpa_xnn_qe_4_32.pte"
VIT:
python -m examples.xnnpack.aot_compiler --model_name="internvit" --delegate --quantize
I've attached the Llama and VIT export logs, the VIT log is quite long, so I only attached the beginning and ending part. I didn't see information about model graph in VIT log. Could you tell me how to modify the code to
check how much of the model graph is lowered to XNNPACK vs. running the portable ops in ExecuTorch (print the graph after running to_backend).
And let me know if any other information are needed.
Thanks a ton for sharing the output. The vit text file vit_export_log.txt
does contain the exported and w/ delegation graph.
So looking at the graph post delegation,
# line 336 where the export graph with delegate starts in your file.
$ awk 'NR > 336' vit_export_log.txt \
| grep -o "call_function\[target=.*\](" \
| sed -r "s/call_function\[target=(.*)\]\(/\1/g" \
| sort -h | uniq -c | sort -n
1 executorch.exir.dialects.edge._ops.aten.select_copy.int
6 executorch.exir.dialects.edge._ops.aten.gelu.default
11 executorch.exir.dialects.edge._ops.aten.native_layer_norm.default
12 executorch.exir.dialects.edge._ops.aten.bmm.default
16 executorch.exir.dialects.edge._ops.aten.squeeze_copy.dims
18 executorch.exir.dialects.edge._ops.aten.clone.default
24 executorch.exir.dialects.edge._ops.aten.expand_copy.default
47 executorch.exir.dialects.edge._ops.aten.view_copy.default
52 torch.ops.higher_order.executorch_call_delegate # these lower to XNNPACK
96 operator.getitem
So a bunch of operators from the ViT graph are running from outside XNNPACK. In ET they can either run on Optimized library or Portable library. Portable impl for bmm or gelu can be slow.
You can validate this by doing something like
adb shell "cd /data/local/tmp; simpleperf record xnn_executor_runner_android --model_path ./vit/internvit_xnnpack_q8.pte && simpleperf report" | less
And skimming the CMake file, it seems like we may not be using optimized library optimized_ops_lib
for xnn_executor_runner
.
Thanks a ton for sharing the output. The vit text file
vit_export_log.txt
does contain the exported and w/ delegation graph.So looking at the graph post delegation,
# line 336 where the export graph with delegate starts in your file. $ awk 'NR > 336' vit_export_log.txt \ | grep -o "call_function\[target=.*\](" \ | sed -r "s/call_function\[target=(.*)\]\(/\1/g" \ | sort -h | uniq -c | sort -n 1 executorch.exir.dialects.edge._ops.aten.select_copy.int 6 executorch.exir.dialects.edge._ops.aten.gelu.default 11 executorch.exir.dialects.edge._ops.aten.native_layer_norm.default 12 executorch.exir.dialects.edge._ops.aten.bmm.default 16 executorch.exir.dialects.edge._ops.aten.squeeze_copy.dims 18 executorch.exir.dialects.edge._ops.aten.clone.default 24 executorch.exir.dialects.edge._ops.aten.expand_copy.default 47 executorch.exir.dialects.edge._ops.aten.view_copy.default 52 torch.ops.higher_order.executorch_call_delegate # these lower to XNNPACK 96 operator.getitem
So a bunch of operators from the ViT graph are running from outside XNNPACK. In ET they can either run on Optimized library or Portable library. Portable impl for bmm or gelu can be slow.
You can validate this by doing something like
adb shell "cd /data/local/tmp; simpleperf record xnn_executor_runner_android --model_path ./vit/internvit_xnnpack_q8.pte && simpleperf report" | less
And skimming the CMake file, it seems like we may not be using optimized library
optimized_ops_lib
forxnn_executor_runner
.
Thank you so much for your invaluable help, @digantdesai! I've included the output from the Android ET runner and the simpleperf report. It currently takes 55 seconds to process a tensor of shape [8,3,448,448] with a 300M VIT model. The simpleperf report indicates that the BMM operation is not leveraging XNNPACK and is responsible for 70% of the total time expenditure.
Does this imply that if we were to optimize the BMM operation with XNNPACK, we could potentially reduce the total time to 55s * 0.3 = 16s? Even so, this would still be a significant amount of time. I'm curious about the expected performance for executing a VIT model of this scale with ET and whether there are any benchmarks or examples I could use for reference. Additionally, I am eager to explore optimization strategies to achieve my ideal running speed of 1 second. Is this goal attainable, and if so, what steps should I take to optimize the performance further?
🐛 Describe the bug
I've encountered a performance issue where executorch's inference speed is significantly slower compared to ONNX, both on linux pc and Android phone. I believe this is a critical issue that needs to be addressed as it affects the efficiency of our model deployment.
Environment:
onnx==1.17.0 onnxruntime==1.20.0 executorch==0.3.0 torch==2.4.0+cu121 python=3.10.15
Linux pc hardware: NVIDIA A100 80GB, Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz Android phone hardware: Qualcomm Snapdragon 8+ Gen 1
Reproduction Steps:
The vit is an InternVIT-300M model, with 7 3 448 * 448 input size.
I export vit model with:
python -m examples.xnnpack.aot_compiler --model_name="internvit" --delegate --quantize
And inference it on linux pc with:
./cmake-out/backends/xnnpack/xnn_executor_runner --model_path=./internvit_xnnpack_q8.pte
inference on Android with:
adb shell ./data/local/tmp/vit/xnn_executor_runner_android --model_path /data/local/tmp/vit/internvit_xnnpack_q8.pte
Expected Behavior:
I'm not quite familiar with inference times for both ONNX and executorch, but I thought they should be within an acceptable performance margin. And I've already exported a llama2-2B model, with a considerable speed TTFT 0.5s + 30 tokens/s on my Android phone. I thought vit-300M inference speed shoud be some how similiar.
Actual Behavior:
onnx inference time on linux pc: 12s vit executorch inference time on linux pc: 450s vit executorch inference time on Android: 200s
Questions:
Is there any known performance regression in executorch compared to ONNX? Are there any optimization techniques or configurations that can improve vit excutorch's performance? I would appreciate any guidance on how to resolve this performance discrepancy. Thank you for your attention to this issue.
Versions
Collecting environment information... PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.6 LTS (x86_64) GCC version: (conda-forge gcc 13.3.0-1) 13.3.0 Clang version: Could not collect CMake version: version 3.30.3 Libc version: glibc-2.31
Python version: 3.10.15 | packaged by conda-forge | (main, Sep 20 2024, 16:37:05) [GCC 13.3.0] (64-bit runtime) Python platform: Linux-4.15.0-191-generic-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: 12.6.68 CUDA_MODULE_LOADING set to: LAZY