Open 165749 opened 1 month ago
@cccclai any ideas here?
Hi @165749, thank you for trying scripts.
issue 1: The init time is expected to be slow, but I wonder why there exists 50 delegates on your side. Is the graph partitioned into 50 shards?
issue 2: This issue has been addressed in https://github.com/pytorch/executorch/pull/4164. The root cause points to generating flatbuffer compatible JSON, where it dumps the delegated byte array with list format. If we could encode/decode the byte array with base64 and store it in string format, the performance would be way faster:
Take exporting first shard of llama2_7b for example (16 cores / 64G ram / 128G swap) [original] > time:~90min / memory:160G [proposed] > time:40s / memory:8G
Hi @haowhsu-quic , thank you very much for your suggestions! I have delved further into both issues.
(1) Stories 110M
Correct, it seems that the model is partitioned into multiple shards. When I attemped to dump the .pte file, I noticed that a significant number of instructions are implemented by native kernels
Instruction(
instr_args=KernelCall(
op_index=1,
args=[
9252(index=0),
480(index=1),
9254(index=2),
9257(index=3),
9260(index=4),
9263(index=5),
9264(index=6),
9267(index=7),
9268(index=8),
9253(index=9),
9253(index=10),
]
)
)(index=531),
and they correspond to aten::convolution
operators=[
Operator(name=aten::index, overload=Tensor_out)(index=0),
Operator(name=aten::convolution, overload=out)(index=1),
],
Notably, this is aligned with my observation during compilation:
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to validate op aten_convolution_default_107 with error 0xc26
[WARNING] [Qnn ExecuTorch]: Qnn Backend op validation failed with error: 3110
[WARNING] [Qnn ExecuTorch]: QnnDsp <W> Received non-Static for tensor at index 4294967295.
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> QnnBackend_validateOpConfig failed 3110
...
[QNN Partitioner Op Support]: aten.convolution.default | False
I am not sure if this is the expected behavior.
(2) Llama-2-7b-chat-hf
Thanks for your advice! After applying #4164, I have successfully generated the .pte files. However, when running the model with the following command (as part of llama_qaihub.py
)
cd /data/local/tmp/executorch/qaihub_llama7b && export ADSP_LIBRARY_PATH=. && export LD_LIBRARY_PATH=. && ./qnn_qaihub_llama_runner --sharded_1_path qaihub_llama7b_0.pte --sharded_2_path qaihub_llama7b_1.pte --sharded_3_path qaihub_llama7b_2.pte --sharded_4_path qaihub_llama7b_3.pte --freq_cos_path freq_cos.raw --freq_sin_path freq_sin.raw --output_path /data/local/tmp/executorch/qaihub_llama7b/outputs/result.txt --tokenizer_path tokenizer.bin --prompt 'What is Python?' --temperature 0.8 --seq_len 128 --eval_mode 1 --logits_scale 0.040317315608263016 --logits_offset 40626
, I encountered an error related to establishing multiple connections to QNN skel (SM8550 with 8GB RAM, QNN v2.24.0.240626):
[ERROR] [Qnn ExecuTorch]: <E> Cannot establish more than one connection to QNN skel: 1009
[ERROR] [Qnn ExecuTorch]: <E> Failed to create a new transport session for deviceId 0, coreId 0, pdId 2: err: 1009
[ERROR] [Qnn ExecuTorch]: <E> Error in creating transport session for deviceId 0, coreId 0, pdId 2, err: 1009
[ERROR] [Qnn ExecuTorch]: <E> Fail to create context from binary with err 1009
...
[ERROR] [Qnn ExecuTorch]: <E> Failed to create context from binary with err 0x3f1
[ERROR] [Qnn ExecuTorch]: Can't create context from binary. Error 1009.
E 00:00:15.988753 executorch:QnnManager.cpp:302] Fail to configure Qnn context
E 00:00:15.988882 executorch:QnnExecuTorchBackend.cpp:51] Fail to initialize Qnn Manager
E 00:00:15.989891 executorch:method.cpp:108] Init failed for backend QnnBackend: 0x1
Do you have any insights into this issue? I have attached the full QNN logs for your reference.
Hi @165749, thank you for the great analysis.
Hi @165749, sorry for the late reply.
I suspect 8G ram might not be enough for loading all shards, please perform following steps before attempting:
# conducting with a fresh status
adb -s $DEVICE_SERIAL reboot
adb -s $DEVICE_SERIAL root
adb -s $DEVICE_SERIAL shell
# type following on device to disable low memory killer
cd /sys/devices/system/memory
for i in $(ls | grep memory); do echo 0 > $i/online; done
for i in $(ls | grep memory); do echo online_kernel > $i/state; done
🐛 Describe the bug
I encountered issues while following the tutorials to run Llama2 on Qualcomm HTP backend. I am using the latest code on SM8550 (8GB RAM) with QNN v2.24.0.240626.
(1) Stories 110M: When running the script
, it successfully generated llama2_qnn.pte (497MB), but the QnnManager::PreRegisterMem function takes an unusually long time. Specifically, the number of custom mem tensors in
shared_buffer_manager.GetCustomMemTensorInfoSet()
is 55,444 in my case, which, in my understanding, is due to the tensors from KV cache (set_all_shifted_ptrs), estimated as 144 (cache element size) 128 (seq_len) 3 (# of caches) = 55,296. However, the weird thing is that thePreRegisterMem
function is invoked for every delegate initialization (n_delegate
is 50 in my case), and the time ofPreRegisterMem
grows linearly (even though the number of custom mem tensors remains constant). Below are the times recorded for the first six runs ofPreRegisterMem
:I am wondering whether the issue is due to an incorrect .pte model generated from
examples/qualcomm/llama2/llama.py
.(2) Llama-2-7b-chat-hf When running the scirpt
, I got an out-of-memory error on a server with 128GB RAM when exporting pte file even for the first shard - specifically, running the following command in a subprocess
requires more than 128GB RAM with
data.json
of size 4.7GB. I am curious whether it is common to require such a large amount of RAM for exporting the pte files, or my configuration is not correct.Any insights would be greatly appreciated!
Versions
PyTorch version: 2.5.0.dev20240716+cpu Is debug build: False CUDA used to build PyTorch: Could not collect ROCM used to build PyTorch: N/A
OS: Ubuntu 24.04 LTS (x86_64) GCC version: (conda-forge gcc 12.3.0-13) 12.3.0 Clang version: 18.1.3 (1ubuntu1) CMake version: version 3.30.1 Libc version: glibc-2.39
Python version: 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-6.8.0-39-generic-x86_64-with-glibc2.39 Is CUDA available: False CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: N/A GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4080 SUPER Nvidia driver version: 550.90.07 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
Versions of relevant libraries: [pip3] executorch==0.4.0a0+febd9c1 [pip3] numpy==1.21.3 [pip3] torch==2.5.0.dev20240716+cpu [pip3] torchao==0.1 [pip3] torchaudio==2.4.0.dev20240716+cpu [pip3] torchsr==1.0.4 [pip3] torchvision==0.20.0.dev20240716+cpu [conda] executorch 0.4.0a0+febd9c1 pypi_0 pypi [conda] numpy 1.21.3 pypi_0 pypi [conda] torch 2.5.0.dev20240716+cpu pypi_0 pypi [conda] torchao 0.1 pypi_0 pypi [conda] torchaudio 2.4.0.dev20240716+cpu pypi_0 pypi [conda] torchsr 1.0.4 pypi_0 pypi [conda] torchvision 0.20.0.dev20240716+cpu pypi_0 pypi