pytorch / executorch

On-device AI across mobile, embedded and edge for PyTorch
https://pytorch.org/executorch/
Other
1.7k stars 288 forks source link

QNN: failed to specified config SOC #3949

Closed zhouwg closed 3 months ago

zhouwg commented 3 months ago

I have a Qualcomm SnapDragon 8 Gen 3 equipped Android phone and trying run QNN SDK(/opt/qcom/aistack/qairt/2.23.0.240531) on this Android phone, I found there are two strange questions:

1.failed to specified config SoC

[qnn_init, 1114]: qualcomm soc_model:57(SM8650), htp_arch:75(QCOM_HTP_V75), vtcm_size:8 MB
[qnn_sdk_logcallback, 874]:      0.0ms [INFO   ]  <I> QnnDevice_freePlatformInfo started

[qnn_sdk_logcallback, 874]:      0.0ms [INFO   ]  <I> QnnDevice_freePlatformInfo done. status 0x0

[qnn_sdk_logcallback, 874]:      0.0ms [INFO   ]  <I> QnnDevice_create started

[qnn_sdk_logcallback, 874]:      0.0ms [VERBOSE]  <V> Create device with id 0x1

[qnn_sdk_logcallback, 874]:      0.0ms [VERBOSE]  <V> Validating device config 0x7fcdd70b10

[qnn_sdk_logcallback, 874]:      0.0ms [VERBOSE]  <V> Setting device config 0x7fcdd70b10 via router

[qnn_sdk_logcallback, 874]:      0.0ms [WARNING]  <W> Specified config SOC, ignoring on real target

[qnn_sdk_logcallback, 874]:      0.0ms [VERBOSE]  <V> Validating device config 0x7fcdd70af0

[qnn_sdk_logcallback, 874]:      0.0ms [VERBOSE]  <V> Setting device config 0x7fcdd70af0 via router

[qnn_sdk_logcallback, 874]:      0.0ms [WARNING]  <W> HTP arch will be deprecated, please set SoC id instead.
  1. why a simple QNN_OP_ELEMENT_WISE_ADD operation cost 53 ms?
[ggml_qnn_add, 2163]: create qnn graph handle with graph name ggml_op_qnn_add_4_tensor_0_tensor_1 ok

[ggml_qnn_add, 2220]: alloc rpcmem successfully

[register_rpcmem, 1645]: mem_fd 24

[register_rpcmem, 1662]: tensor tensor_0000 successfully register shared memory

[ggml_qnn_add, 2229]: alloc rpcmem successfully

[register_rpcmem, 1645]: mem_fd 25

[register_rpcmem, 1662]: tensor tensor_0001 successfully register shared memory

[ggml_qnn_add, 2238]: alloc rpcmem successfully

[register_rpcmem, 1645]: mem_fd 26

[register_rpcmem, 1662]: tensor tensor_0002 successfully register shared memory

[info, 511]: duration of ggml_qnn_add : 53436 microseconds

[qnn_op_ut, 2037]: dump tensors:
[tensor_dump, 1404]: dump ggml tensor src0(tensor_0): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
    0.75    -0.35    -0.30     0.44 
    0.19     0.66     0.11     0.17 
    0.98     0.05    -0.93    -0.33 
   -0.42     0.02    -0.93     0.56 

[tensor_dump, 1404]: dump ggml tensor src1(tensor_1): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
   -0.68    -0.87     0.09     0.49 
    0.15    -0.56    -0.53    -0.78 
   -0.72     0.31     0.14     0.43 
    0.64    -0.53     0.07     0.59 

[tensor_dump, 1404]: dump ggml tensor dst(tensor_2): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
    0.07    -1.22    -0.21     0.93 
    0.34     0.10    -0.42    -0.61 
    0.26     0.35    -0.80     0.11 

Could anyone technical expert from QTI help to explain why these happend? thanks so much.

digantdesai commented 3 months ago

Did you try ExecuTorch delegate? https://github.com/pytorch/executorch/tree/main/backends/qualcomm

zhouwg commented 3 months ago

Thanks for your response.

Deploy ExecuTorch on Android is not a easy thing(it seems there are so many dependencies with ExecuTorch on Android) and I hasn't try that. Could you help to explain what's a special approach was used in ExecuTorch with QNN API(I has read the QNN SDK reference manual many times)?

I'll try the ExecuTorch on Android later.thanks.

digantdesai commented 3 months ago

Check out ExecuTorch Android demo app which can leverage QNN - https://pytorch.org/executorch/stable/demo-apps-android.html

zhouwg commented 3 months ago

Thanks so much for you guidance. I'm trying the ExecuTorch on Android but failed in setup stage of ExecuTorch's dev envs which caused by GFW:

fatal: the remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed
fatal: clone of 'https://github.com/GPUOpen-LibrariesAndSDKs/VulkanMemoryAllocator.git' into submodule path '/home/weiguo/executorch/backends/vulkan/third-party/VulkanMemoryAllocator' failed
Failed to clone 'backends/vulkan/third-party/VulkanMemoryAllocator'. Retry scheduled
Cloning into '/home/weiguo/executorch/backends/xnnpack/third-party/XNNPACK'...
Cloning into '/home/weiguo/executorch/examples/models/llama2/third-party/abseil-cpp'...
Cloning into '/home/weiguo/executorch/examples/third-party/LLaVA'...
Cloning into '/home/weiguo/executorch/kernels/optimized/third-party/eigen'...
Cloning into '/home/weiguo/executorch/third-party/flatbuffers'...
Cloning into '/home/weiguo/executorch/third-party/flatcc'...
Cloning into '/home/weiguo/executorch/third-party/gflags'...
Cloning into '/home/weiguo/executorch/third-party/googletest'...
Cloning into '/home/weiguo/executorch/third-party/ios-cmake'...
Cloning into '/home/weiguo/executorch/third-party/prelude'...
Cloning into '/home/weiguo/executorch/third-party/pybind11'...
Cloning into '/home/weiguo/executorch/backends/vulkan/third-party/VulkanMemoryAllocator'...
Submodule path 'backends/arm/third-party/ethos-u-core-driver': checked out '90f9df900acdc0718ecd2dfdc53780664758dec5'
Submodule path 'backends/arm/third-party/serialization_lib': checked out '187af0d41fe75d08d2a7ec84c1b4d24b9b641ed2'
Submodule path 'backends/vulkan/third-party/Vulkan-Headers': checked out '0c5928795a66e93f65e5e68a36d8daa79a209dc2'
Submodule path 'backends/vulkan/third-party/VulkanMemoryAllocator': checked out 'a6bfc237255a6bac1513f7c1ebde6d8aed6b5191'
Submodule path 'backends/vulkan/third-party/volk': checked out 'b3bc21e584f97400b6884cb2a541a56c6a5ddba3'
Submodule path 'backends/xnnpack/third-party/FP16': checked out '4dfe081cf6bcd15db339cf2680b9281b8451eeb3'
Submodule path 'backends/xnnpack/third-party/FXdiv': checked out 'b408327ac2a15ec3e43352421954f5b1967701d1'
Submodule path 'backends/xnnpack/third-party/XNNPACK': checked out '20c0d886fb78d6497362e8303b999bf5d67aaa02'
Submodule path 'backends/xnnpack/third-party/cpuinfo': checked out 'd6860c477c99f1fce9e28eb206891af3c0e1a1d7'
Submodule path 'backends/xnnpack/third-party/pthreadpool': checked out '4fe0e1e183925bf8cfa6aae24237e724a96479b8'
Submodule path 'examples/models/llama2/third-party/abseil-cpp': checked out '854193071498f330b71083d7e06a7cd18e02a4cc'
Submodule path 'examples/models/llama2/third-party/re2': checked out 'ac82d4f628a2045d89964ae11c48403d3b091af1'
Submodule path 'examples/third-party/LLaVA': checked out '7440ec9ee37b0374c6b5548818e89878e38f3353'
Submodule path 'examples/third-party/fbjni': checked out '52a14f0daa889a20d8984798b8d96eb03cebd334'
Submodule path 'kernels/optimized/third-party/eigen': checked out 'a39ade4ccf99df845ec85c580fbbb324f71952fa'
Submodule path 'third-party/flatbuffers': checked out '0100f6a5779831fa7a651e4b67ef389a8752bd9b'
Submodule path 'third-party/flatcc': checked out 'eb5228f76d395bffe31a33398ff73e60dfba5914'
Submodule path 'third-party/gflags': checked out 'a738fdf9338412f83ab3f26f31ac11ed3f3ec4bd'
Submodule path 'third-party/googletest': checked out 'e2239ee6043f73722e7aa812a459f54a28552929'
Submodule path 'third-party/ios-cmake': checked out '06465b27698424cf4a04a5ca4904d50a3c966c45'
Submodule path 'third-party/prelude': checked out '4e9e6d50b8b461564a7e351ff60b87fe59d7e53b'
Submodule path 'third-party/pybind11': checked out '8c7b8dd0ae74b36b7d42f77b0dd4096ebb7f4ab1'
Submodule path 'backends/vulkan/third-party/VulkanMemoryAllocator': checked out 'a6bfc237255a6bac1513f7c1ebde6d8aed6b5191'
(executorch) weiguo:$ ./install_requirements.sh 
Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/nightly/cpu
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f72a2601810>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /simple/torch/
export https_proxy="http://127.0.0.1:8119"
(executorch) weiguo:$ ./install_requirements.sh 
Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/nightly/cpu
Collecting torch==2.4.0.dev20240507
  Downloading https://download.pytorch.org/whl/nightly/cpu/torch-2.4.0.dev20240507%2Bcpu-cp310-cp310-linux_x86_64.whl (192.7 MB)
     ━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.5/192.7 MB <b>126.9 kB/s</b> eta 0:24:28
Downloading https://download.pytorch.org/whl/nightly/cpu/torch-2.4.0.dev20240507%2Bcpu-cp310-cp310-linux_x86_64.whl (192.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/192.7 MB 75.1 kB/s eta 0:42:21

The following dependent file is so big and the network speed is so slow/unstable on my side and I have to skip this step.

(executorch) weiguo:$ python -m examples.qualcomm.scripts.deeplab_v3 -b build_android -m SM8550 --compile_only --download
QNN_SDK_ROOT=/opt/qcom/aistack/qnn/2.20.0.240223
LD_LIBRARY_PATH=/opt/qcom/aistack/qnn/2.20.0.240223/lib/x86_64-linux-clang/:
Downloading http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar to ./deeplab_v3/voc_image/VOCtrainval_11-May-2012.tar
  0%|                                                                                                               | 1900544/1999639040 [00:11<3:51:06, 144066.45it/s]  0%|                                                                                                               | 1900544/1999639040 [00:11<3:20:52, 165749.57it/s]

I'll try check out ExecuTorch Android demo app accordingly after fix that issue on my side. I'll update this loop accordingly if I have any positive progress.

Thanks again.

chiwwang commented 3 months ago

@zhouwg you need to set HTP precision custom configuration to fp16 to get a reasonable floating point performance. Your case looks like a fp3 addition and HTP doens't offically support it.

And you can see this tracee

[qnn_sdk_logcallback, 874]:      0.0ms [WARNING]  <W> Specified config SOC, ignoring on real target

QNN ignore SOC configuration because the program runs on a real target.... QNN is able to detect it.

mmm but I'm not sure if ExecuTorch repository is a correct place to ask QNN questions. You might want to check Quallcomm QPM forum instead.

zhouwg commented 3 months ago

@zhouwg you need to set HTP precision custom configuration to fp16 to get a reasonable floating point performance. Your case looks like a fp3 addition and HTP doens't offically support it.

And you can see this tracee

[qnn_sdk_logcallback, 874]:      0.0ms [WARNING]  <W> Specified config SOC, ignoring on real target

QNN ignore SOC configuration because the program runs on a real target.... QNN is able to detect it.

mmm but I'm not sure if ExecuTorch repository is a correct place to ask QNN questions. You might want to check Quallcomm QPM forum instead.

Thanks so much for helps me again.

I'm sorry for that and I'll submit ticket in Qualcomm's QPM forum next time.

chiwwang commented 3 months ago

Thanks. Or you can ping me in GGML pull-requests. ( from the trace [ggml_qnn_add, 2229] I guess you're doing ggml things)

zhouwg commented 3 months ago

Thanks. Or you can ping me in GGML pull-requests. ( from the trace [ggml_qnn_add, 2229] I guess you're doing ggml things)

thanks too much. I can't find the the trace in upstream. can we discuss this problem in my personal learning&study project:https://github.com/zhouwg/kantv/tree/ggml-qnn-quantize/core/ggml/llamacpp/tests/ggml-qnn ?