pytorch / executorch

On-device AI across mobile, embedded and edge for PyTorch
https://pytorch.org/executorch/
Other
1.72k stars 294 forks source link

Errors when running Llama on QNN backend #4517

Open 165749 opened 1 month ago

165749 commented 1 month ago

🐛 Describe the bug

I encountered issues while following the tutorials to run Llama2 on Qualcomm HTP backend. I am using the latest code on SM8550 (8GB RAM) with QNN v2.24.0.240626.

(1) Stories 110M: When running the script

python examples/qualcomm/llama2/llama.py -a ${ARTIFACTS} -b build_android -s ${SERIAL_NUM} -m ${SOC_MODEL} --ptq 16a4w --checkpoint stories110M.pt --params params.json --tokenizer_model tokenizer.model --tokenizer_bin tokenizer.bin --prompt "Once"

, it successfully generated llama2_qnn.pte (497MB), but the QnnManager::PreRegisterMem function takes an unusually long time. Specifically, the number of custom mem tensors in shared_buffer_manager.GetCustomMemTensorInfoSet() is 55,444 in my case, which, in my understanding, is due to the tensors from KV cache (set_all_shifted_ptrs), estimated as 144 (cache element size) 128 (seq_len) 3 (# of caches) = 55,296. However, the weird thing is that the PreRegisterMem function is invoked for every delegate initialization (n_delegate is 50 in my case), and the time of PreRegisterMem grows linearly (even though the number of custom mem tensors remains constant). Below are the times recorded for the first six runs of PreRegisterMem:

Run 0: 16278838 us
Run 1: 66038691 us
Run 2: 128975195 us
Run 3: 183058687 us
Run 4: 238663807 us
Run 5: 294736509 us

I am wondering whether the issue is due to an incorrect .pte model generated from examples/qualcomm/llama2/llama.py.

(2) Llama-2-7b-chat-hf When running the scirpt

python examples/qualcomm/llama2/llama_qaihub.py -a ${ARTIFACTS} -b build_android -s ${SERIAL_NUM} -m ${SOC_MODEL} --context_binaries ${AIHUB_CONTEXT_BINARIES} --tokenizer_bin tokenizer.bin --prompt "What is Python?"

, I got an out-of-memory error on a server with 128GB RAM when exporting pte file even for the first shard - specifically, running the following command in a subprocess

flatc --binary -o /tmp/tmpqccknutm /tmp/tmpqccknutm/program.fbs /tmp/tmpqccknutm/data.json

requires more than 128GB RAM with data.json of size 4.7GB. I am curious whether it is common to require such a large amount of RAM for exporting the pte files, or my configuration is not correct.

Any insights would be greatly appreciated!

Versions

PyTorch version: 2.5.0.dev20240716+cpu Is debug build: False CUDA used to build PyTorch: Could not collect ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04 LTS (x86_64) GCC version: (conda-forge gcc 12.3.0-13) 12.3.0 Clang version: 18.1.3 (1ubuntu1) CMake version: version 3.30.1 Libc version: glibc-2.39

Python version: 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-6.8.0-39-generic-x86_64-with-glibc2.39 Is CUDA available: False CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: N/A GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4080 SUPER Nvidia driver version: 550.90.07 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

Versions of relevant libraries: [pip3] executorch==0.4.0a0+febd9c1 [pip3] numpy==1.21.3 [pip3] torch==2.5.0.dev20240716+cpu [pip3] torchao==0.1 [pip3] torchaudio==2.4.0.dev20240716+cpu [pip3] torchsr==1.0.4 [pip3] torchvision==0.20.0.dev20240716+cpu [conda] executorch 0.4.0a0+febd9c1 pypi_0 pypi [conda] numpy 1.21.3 pypi_0 pypi [conda] torch 2.5.0.dev20240716+cpu pypi_0 pypi [conda] torchao 0.1 pypi_0 pypi [conda] torchaudio 2.4.0.dev20240716+cpu pypi_0 pypi [conda] torchsr 1.0.4 pypi_0 pypi [conda] torchvision 0.20.0.dev20240716+cpu pypi_0 pypi

mcr229 commented 1 month ago

@cccclai any ideas here?

haowhsu-quic commented 1 month ago

Hi @165749, thank you for trying scripts.

165749 commented 1 month ago

Hi @haowhsu-quic , thank you very much for your suggestions! I have delved further into both issues.

(1) Stories 110M

Correct, it seems that the model is partitioned into multiple shards. When I attemped to dump the .pte file, I noticed that a significant number of instructions are implemented by native kernels

            Instruction(
              instr_args=KernelCall(
                op_index=1, 
                args=[
                  9252(index=0),
                  480(index=1),
                  9254(index=2),
                  9257(index=3),
                  9260(index=4),
                  9263(index=5),
                  9264(index=6),
                  9267(index=7),
                  9268(index=8),
                  9253(index=9),
                  9253(index=10),
                ]
              )
            )(index=531),

and they correspond to aten::convolution

      operators=[
        Operator(name=aten::index, overload=Tensor_out)(index=0),
        Operator(name=aten::convolution, overload=out)(index=1),
      ], 

Notably, this is aligned with my observation during compilation:

[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to validate op aten_convolution_default_107 with error 0xc26

[WARNING] [Qnn ExecuTorch]: Qnn Backend op validation failed with error: 3110
[WARNING] [Qnn ExecuTorch]: QnnDsp <W> Received non-Static for tensor at index 4294967295.

[ERROR] [Qnn ExecuTorch]: QnnDsp <E> QnnBackend_validateOpConfig failed 3110

...

[QNN Partitioner Op Support]: aten.convolution.default | False

I am not sure if this is the expected behavior.

(2) Llama-2-7b-chat-hf

Thanks for your advice! After applying #4164, I have successfully generated the .pte files. However, when running the model with the following command (as part of llama_qaihub.py)

cd /data/local/tmp/executorch/qaihub_llama7b && export ADSP_LIBRARY_PATH=. && export LD_LIBRARY_PATH=. && ./qnn_qaihub_llama_runner --sharded_1_path qaihub_llama7b_0.pte --sharded_2_path qaihub_llama7b_1.pte --sharded_3_path qaihub_llama7b_2.pte --sharded_4_path qaihub_llama7b_3.pte --freq_cos_path freq_cos.raw --freq_sin_path freq_sin.raw --output_path /data/local/tmp/executorch/qaihub_llama7b/outputs/result.txt --tokenizer_path tokenizer.bin --prompt 'What is Python?' --temperature 0.8 --seq_len 128 --eval_mode 1 --logits_scale 0.040317315608263016 --logits_offset 40626

, I encountered an error related to establishing multiple connections to QNN skel (SM8550 with 8GB RAM, QNN v2.24.0.240626):

[ERROR] [Qnn ExecuTorch]:  <E> Cannot establish more than one connection to QNN skel: 1009

[ERROR] [Qnn ExecuTorch]:  <E> Failed to create a new transport session for deviceId 0, coreId 0, pdId 2: err: 1009

[ERROR] [Qnn ExecuTorch]:  <E> Error in creating transport session for deviceId 0, coreId 0, pdId 2, err: 1009

[ERROR] [Qnn ExecuTorch]:  <E> Fail to create context from binary with err 1009

...

[ERROR] [Qnn ExecuTorch]:  <E> Failed to create context from binary with err 0x3f1

[ERROR] [Qnn ExecuTorch]: Can't create context from binary. Error 1009.
E 00:00:15.988753 executorch:QnnManager.cpp:302] Fail to configure Qnn context
E 00:00:15.988882 executorch:QnnExecuTorchBackend.cpp:51] Fail to initialize Qnn Manager
E 00:00:15.989891 executorch:method.cpp:108] Init failed for backend QnnBackend: 0x1

Do you have any insights into this issue? I have attached the full QNN logs for your reference.

Click to expand the full log ``` [INFO] [Qnn ExecuTorch]: create QNN Logger with log_level 3 [INFO] [Qnn ExecuTorch]: QnnLog_create started. [WARNING] [Qnn ExecuTorch]: Initializing HtpProvider [INFO] [Qnn ExecuTorch]: exit with 0 [INFO] [Qnn ExecuTorch]: exit with 0 [INFO] [Qnn ExecuTorch]: First connection to QNN stub established! [INFO] [Qnn ExecuTorch]: Applying log level 3 to 0 devices [WARNING] [Qnn ExecuTorch]: Function not called, PrepareLib isn't loaded! [INFO] [Qnn ExecuTorch]: QnnLog_create exit. [INFO] [Qnn ExecuTorch]: Initialize Qnn backend parameters for Qnn executorch backend type 2 [INFO] [Qnn ExecuTorch]: Caching: Caching is in RESTORE MODE. [INFO] [Qnn ExecuTorch]: QnnBackend_create started. backend = 0x7c40b488 [INFO] [Qnn ExecuTorch]: QnnBackend_create done successfully. backend = 0x7c40b488 [INFO] [Qnn ExecuTorch]: QnnDevice_create started [WARNING] [Qnn ExecuTorch]: sg_stubPtr is not null, skip loadRemoteSymbols [INFO] [Qnn ExecuTorch]: exits with 2, successfully initialized rpc memory [INFO] [Qnn ExecuTorch]: rpcMemoryAlloc 8 isInit 1 [WARNING] [Qnn ExecuTorch]: This META does not have Alloc2 Support [INFO] [Qnn ExecuTorch]: rpcMemoryAlloc 136 isInit 1 [WARNING] [Qnn ExecuTorch]: This META does not have Alloc2 Support [INFO] [Qnn ExecuTorch]: rpcMemoryAlloc 8 isInit 1 [WARNING] [Qnn ExecuTorch]: This META does not have Alloc2 Support [INFO] [Qnn ExecuTorch]: rpcMemoryAlloc 8 isInit 1 [WARNING] [Qnn ExecuTorch]: This META does not have Alloc2 Support [INFO] [Qnn ExecuTorch]: rpcMemoryAlloc 56 isInit 1 [WARNING] [Qnn ExecuTorch]: This META does not have Alloc2 Support [WARNING] [Qnn ExecuTorch]: sg_stubPtr is not null, skip loadRemoteSymbols [WARNING] [Qnn ExecuTorch]: Function not called, PrepareLib isn't loaded! [INFO] [Qnn ExecuTorch]: QnnDevice_create done. device = 0x1. status 0x0 [INFO] [Qnn ExecuTorch]: QnnDevice_getInfrastructure started [INFO] [Qnn ExecuTorch]: qnnHtpPerfInfrastructureGet started. [INFO] [Qnn ExecuTorch]: qnnHtpPerfInfrastructureGet done. status 0x0 [INFO] [Qnn ExecuTorch]: QnnDevice_getInfrastructure done. status 0x0 [INFO] [Qnn ExecuTorch]: htpPerfInfrastructureCreatePowerConfigId started for deviceId: 0, coreId: 0 [INFO] [Qnn ExecuTorch]: htpPerfInfrastructureCreatePowerConfigId done. status 0x0 [INFO] [Qnn ExecuTorch]: htpPerfInfrastructureSetPowerConfig started for powerConfigId: 1 [INFO] [Qnn ExecuTorch]: Graph Stats ptr not added for device 0 core 0 processDomain 0 [INFO] [Qnn ExecuTorch]: htpPerfInfrastructureSetPowerConfig done. status 0x0 [INFO] [Qnn ExecuTorch]: htpPerfInfrastructureSetPowerConfig started for powerConfigId: 1 [INFO] [Qnn ExecuTorch]: htpPerfInfrastructureSetPowerConfig done. status 0x0 [INFO] [Qnn ExecuTorch]: QnnContext_createFromBinary started. backend = 0x1, device = 0x1 [INFO] [Qnn ExecuTorch]: Context Blob version: 3.1.2 [WARNING] [Qnn ExecuTorch]: sg_stubPtr is not null, skip loadRemoteSymbols [WARNING] [Qnn ExecuTorch]: Failed to create new env handle with error 1100 [INFO] [Qnn ExecuTorch]: updatePerfInfraConfigs started for deviceId: 0, coreId: 0 [INFO] [Qnn ExecuTorch]: Graph Stats ptr not added for device 0 core 0 processDomain 0 [INFO] [Qnn ExecuTorch]: updatePerfInfraConfigs done. status 0x0 [WARNING] [Qnn ExecuTorch]: sg_stubPtr is not null, skip loadRemoteSymbols [WARNING] [Qnn ExecuTorch]: Function not called, PrepareLib isn't loaded! [INFO] [Qnn ExecuTorch]: rpcMemoryAlloc 67091144 isInit 1 [WARNING] [Qnn ExecuTorch]: This META does not have Alloc2 Support [INFO] [Qnn ExecuTorch]: rpcMemoryAlloc 118880 isInit 1 [WARNING] [Qnn ExecuTorch]: This META does not have Alloc2 Support [INFO] [Qnn ExecuTorch]: rpcMemoryAlloc 136376 isInit 1 [WARNING] [Qnn ExecuTorch]: This META does not have Alloc2 Support [INFO] [Qnn ExecuTorch]: RPC Memory allocated for graph tmpy9me94uk with In 0x7b1897b000 [3ffbac8 B], Inouts 0x7c7c8c3000 [214b8 B], Outs 0x7c7cb25000 [1d060 B] [INFO] [Qnn ExecuTorch]: rpcMemoryAlloc 28 isInit 1 [WARNING] [Qnn ExecuTorch]: This META does not have Alloc2 Support [INFO] [Qnn ExecuTorch]: rpcMemoryAlloc 1083179712 isInit 1 [WARNING] [Qnn ExecuTorch]: This META does not have Alloc2 Support [INFO] [Qnn ExecuTorch]: rpcMemoryAlloc 1080033280 isInit 1 [WARNING] [Qnn ExecuTorch]: This META does not have Alloc2 Support [INFO] [Qnn ExecuTorch]: Context Blob version: 3.1.2 [INFO] [Qnn ExecuTorch]: rpcMemoryAlloc 3557696 isInit 1 [WARNING] [Qnn ExecuTorch]: This META does not have Alloc2 Support [INFO] [Qnn ExecuTorch]: rpcMemoryAlloc 40 isInit 1 [WARNING] [Qnn ExecuTorch]: This META does not have Alloc2 Support [WARNING] [Qnn ExecuTorch]: This option (2) is only for offline prepare case. [INFO] [Qnn ExecuTorch]: graph tmpy9me94uk is loaded 1 [INFO] [Qnn ExecuTorch]: rpcMemoryAlloc 48 isInit 1 [WARNING] [Qnn ExecuTorch]: This META does not have Alloc2 Support [INFO] [Qnn ExecuTorch]: QnnContext_createFromBinary done successfully. context = 0x1 [INFO] [Qnn ExecuTorch]: Running level=3 optimization. [INFO] [Qnn ExecuTorch]: QnnGraph_retrieve started. context = 0x1 [INFO] [Qnn ExecuTorch]: QnnGraph_retrieve done. status 0x0 [INFO] [Qnn ExecuTorch]: create QNN Logger with log_level 3 [INFO] [Qnn ExecuTorch]: QnnLog_create started. [INFO] [Qnn ExecuTorch]: Applying log level 3 to 1 devices [INFO] [Qnn ExecuTorch]: QnnLog_create exit. [INFO] [Qnn ExecuTorch]: Initialize Qnn backend parameters for Qnn executorch backend type 2 [INFO] [Qnn ExecuTorch]: Caching: Caching is in RESTORE MODE. [INFO] [Qnn ExecuTorch]: QnnBackend_create started. backend = 0xf095b4e8 [INFO] [Qnn ExecuTorch]: QnnBackend_create done successfully. backend = 0xf095b4e8 [INFO] [Qnn ExecuTorch]: QnnDevice_create started [INFO] [Qnn ExecuTorch]: QnnDevice_create done. device = 0x2. status 0x0 [INFO] [Qnn ExecuTorch]: QnnDevice_getInfrastructure started [INFO] [Qnn ExecuTorch]: QnnDevice_getInfrastructure done. status 0x0 [INFO] [Qnn ExecuTorch]: htpPerfInfrastructureCreatePowerConfigId started for deviceId: 0, coreId: 0 [INFO] [Qnn ExecuTorch]: htpPerfInfrastructureCreatePowerConfigId done. status 0x0 [INFO] [Qnn ExecuTorch]: htpPerfInfrastructureSetPowerConfig started for powerConfigId: 2 [INFO] [Qnn ExecuTorch]: htpPerfInfrastructureSetPowerConfig done. status 0x0 [INFO] [Qnn ExecuTorch]: htpPerfInfrastructureSetPowerConfig started for powerConfigId: 2 [INFO] [Qnn ExecuTorch]: htpPerfInfrastructureSetPowerConfig done. status 0x0 [INFO] [Qnn ExecuTorch]: QnnContext_createFromBinary started. backend = 0x2, device = 0x2 [INFO] [Qnn ExecuTorch]: Context Blob version: 3.1.2 [WARNING] [Qnn ExecuTorch]: sg_stubPtr is not null, skip loadRemoteSymbols [WARNING] [Qnn ExecuTorch]: Failed to create new env handle with error 1100 [INFO] [Qnn ExecuTorch]: updatePerfInfraConfigs started for deviceId: 0, coreId: 0 [INFO] [Qnn ExecuTorch]: updatePerfInfraConfigs done. status 0x0 [WARNING] [Qnn ExecuTorch]: sg_stubPtr is not null, skip loadRemoteSymbols [WARNING] [Qnn ExecuTorch]: Function not called, PrepareLib isn't loaded! [INFO] [Qnn ExecuTorch]: rpcMemoryAlloc 67099328 isInit 1 [WARNING] [Qnn ExecuTorch]: This META does not have Alloc2 Support [INFO] [Qnn ExecuTorch]: rpcMemoryAlloc 118880 isInit 1 [WARNING] [Qnn ExecuTorch]: This META does not have Alloc2 Support [INFO] [Qnn ExecuTorch]: rpcMemoryAlloc 136360 isInit 1 [WARNING] [Qnn ExecuTorch]: This META does not have Alloc2 Support [INFO] [Qnn ExecuTorch]: RPC Memory allocated for graph tmptznss4bs with In 0x79df802000 [3ffdac0 B], Inouts 0x7c7c8a1000 [214a8 B], Outs 0x7c7caf0000 [1d060 B] [INFO] [Qnn ExecuTorch]: rpcMemoryAlloc 28 isInit 1 [WARNING] [Qnn ExecuTorch]: This META does not have Alloc2 Support [INFO] [Qnn ExecuTorch]: rpcMemoryAlloc 817889280 isInit 1 [WARNING] [Qnn ExecuTorch]: This META does not have Alloc2 Support [ERROR] [Qnn ExecuTorch]: fastrpc memory map for fd: 33 with length: 817889280 failed with error: 0x1 [ERROR] [Qnn ExecuTorch]: Failed to map weights buffer to device! [ERROR] [Qnn ExecuTorch]: Could not allocate persistent weights buffer! [ERROR] [Qnn ExecuTorch]: Failed to initialize graph memory [ERROR] [Qnn ExecuTorch]: Failed to initialize graph with id 256 context 2 deviceId 0 coreId 0 pdId 0 with err 1002 [ERROR] [Qnn ExecuTorch]: Context create from binary failed for deviceId 0 coreId 0 pdId 0 err 1002 [INFO] [Qnn ExecuTorch]: Graph Rpc memory was not initialized [INFO] [Qnn ExecuTorch]: rpcMemoryAlloc 48 isInit 1 [WARNING] [Qnn ExecuTorch]: This META does not have Alloc2 Support [INFO] [Qnn ExecuTorch]: rpcMemoryAlloc 40 isInit 1 [WARNING] [Qnn ExecuTorch]: This META does not have Alloc2 Support [ERROR] [Qnn ExecuTorch]: Transport session for deviceId 0 coreId 0 pdId 2 not found! [ERROR] [Qnn ExecuTorch]: Transport session for deviceId 0 coreId 0 pdId 2 not found! [WARNING] [Qnn ExecuTorch]: sg_stubPtr is not null, skip loadRemoteSymbols [INFO] [Qnn ExecuTorch]: exits with 2, successfully initialized rpc memory [INFO] [Qnn ExecuTorch]: rpcMemoryAlloc 8 isInit 1 [WARNING] [Qnn ExecuTorch]: This META does not have Alloc2 Support [INFO] [Qnn ExecuTorch]: rpcMemoryAlloc 136 isInit 1 [WARNING] [Qnn ExecuTorch]: This META does not have Alloc2 Support [INFO] [Qnn ExecuTorch]: rpcMemoryAlloc 8 isInit 1 [WARNING] [Qnn ExecuTorch]: This META does not have Alloc2 Support [INFO] [Qnn ExecuTorch]: rpcMemoryAlloc 8 isInit 1 [WARNING] [Qnn ExecuTorch]: This META does not have Alloc2 Support [ERROR] [Qnn ExecuTorch]: Cannot establish more than one connection to QNN skel: 1009 [ERROR] [Qnn ExecuTorch]: Failed to create a new transport session for deviceId 0, coreId 0, pdId 2: err: 1009 [ERROR] [Qnn ExecuTorch]: Error in creating transport session for deviceId 0, coreId 0, pdId 2, err: 1009 [ERROR] [Qnn ExecuTorch]: Fail to create context from binary with err 1009 [INFO] [Qnn ExecuTorch]: QnnContext_free started. context = 0x2 [INFO] [Qnn ExecuTorch]: Graph Rpc memory was not initialized [INFO] [Qnn ExecuTorch]: rpcMemoryAlloc 48 isInit 1 [WARNING] [Qnn ExecuTorch]: This META does not have Alloc2 Support [INFO] [Qnn ExecuTorch]: rpcMemoryAlloc 40 isInit 1 [WARNING] [Qnn ExecuTorch]: This META does not have Alloc2 Support [WARNING] [Qnn ExecuTorch]: sg_stubPtr is not null, skip loadRemoteSymbols [INFO] [Qnn ExecuTorch]: QnnContext_free done successfully. [ERROR] [Qnn ExecuTorch]: Failed to create context from binary with err 0x3f1 [ERROR] [Qnn ExecuTorch]: Can't create context from binary. Error 1009. E 00:00:15.235067 executorch:QnnManager.cpp:302] Fail to configure Qnn context E 00:00:15.235135 executorch:QnnExecuTorchBackend.cpp:51] Fail to initialize Qnn Manager E 00:00:15.235524 executorch:method.cpp:108] Init failed for backend QnnBackend: 0x1 [INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters [INFO] [Qnn ExecuTorch]: Destroy Qnn context [INFO] [Qnn ExecuTorch]: QnnContext_free started. context = 0x1 [INFO] [Qnn ExecuTorch]: Graph Rpc memory was not initialized [INFO] [Qnn ExecuTorch]: rpcMemoryAlloc 48 isInit 1 [WARNING] [Qnn ExecuTorch]: This META does not have Alloc2 Support [INFO] [Qnn ExecuTorch]: rpcMemoryAlloc 40 isInit 1 [WARNING] [Qnn ExecuTorch]: This META does not have Alloc2 Support [WARNING] [Qnn ExecuTorch]: sg_stubPtr is not null, skip loadRemoteSymbols [INFO] [Qnn ExecuTorch]: QnnContext_free done successfully. [INFO] [Qnn ExecuTorch]: htpPerfInfrastructureSetPowerConfig started for powerConfigId: 1 [INFO] [Qnn ExecuTorch]: htpPerfInfrastructureSetPowerConfig done. status 0x0 [INFO] [Qnn ExecuTorch]: htpPerfInfrastructureDestroyPowerConfigId started for powerConfigId: 1 [INFO] [Qnn ExecuTorch]: htpPerfInfrastructureDestroyPowerConfigId done. status 0x0 [INFO] [Qnn ExecuTorch]: Destroy Qnn device [INFO] [Qnn ExecuTorch]: QnnDevice_free started. device = 0x1 [INFO] [Qnn ExecuTorch]: QnnDevice_free done. status 0x0 [INFO] [Qnn ExecuTorch]: Destroy Qnn backend [INFO] [Qnn ExecuTorch]: QnnBackend_free started. backend = 0x1 [INFO] [Qnn ExecuTorch]: QnnLog_free started. [INFO] [Qnn ExecuTorch]: QnnLog_free exit. [WARNING] [Qnn ExecuTorch]: Backend 2 free cleanup called during process exit [INFO] [Qnn ExecuTorch]: rpcMemoryAlloc 8 isInit 1 [WARNING] [Qnn ExecuTorch]: This META does not have Alloc2 Support [WARNING] [Qnn ExecuTorch]: qnnOpPackageManager: hexagon unload op package function pointer is nullptr! [WARNING] [Qnn ExecuTorch]: This META does not have Alloc2 Support [WARNING] [Qnn ExecuTorch]: sg_stubPtr is not null, skip loadRemoteSymbols [WARNING] [Qnn ExecuTorch]: This META does not have Alloc2 Support ```
haowhsu-quic commented 1 month ago

Hi @165749, thank you for the great analysis.

  • Issue1 was introduced by recent QNN version (>= 2.24). You could fallback to 2.23 or apply the patch of on-going https://github.com/pytorch/executorch/pull/4560. The llama2.py is expected to be fully delegated.
  • Issue2 might need some time to investigate, I'll find a SM8550 device and get back to you ASAP.
haowhsu-quic commented 1 month ago

Hi @165749, sorry for the late reply.

  • Issue1: You could expect to get a fully delegated story llama in main branch now.
  • Issue2: Please deprecate changes related to https://github.com/pytorch/executorch/pull/4164, I believe the memory footprint and lowering speed is very efficient now.
    I also found it works well on a SM8550 W/11G ram.

I suspect 8G ram might not be enough for loading all shards, please perform following steps before attempting:

# conducting with a fresh status
adb -s $DEVICE_SERIAL reboot
adb -s $DEVICE_SERIAL root
adb -s $DEVICE_SERIAL shell
# type following on device to disable low memory killer
cd /sys/devices/system/memory
for i in $(ls | grep memory); do echo 0 > $i/online; done
for i in $(ls | grep memory); do echo online_kernel > $i/state; done