quic / ai-hub-models

The Qualcomm® AI Hub Models are a collection of state-of-the-art machine learning models optimized for performance (latency, memory etc.) and ready to deploy on Qualcomm® devices.
https://aihub.qualcomm.com
BSD 3-Clause "New" or "Revised" License
415 stars 58 forks source link

QCT Genie SDK (genie-t2t-run) fails to run on QNN HTP backend #82

Closed taeyeonlee closed 2 days ago

taeyeonlee commented 1 month ago

Describe the bug QCT Genie SDK (genie-t2t-run) fails to run Llama2 7b model on QNN HTP backend, in my Android Mobile S24 Ultra. What does it mean ? The error log is "map::at: key not found ERROR at line 244: Failed to create the dialog. " How to run Llama2 on QNN HTP backend ?

To Reproduce According to the tutorial (file:///C:/Qualcomm/AIStack/QAIRT/2.25.0.240728/docs/Genie/general/tutorials.html), To run on QNN HTP backend, open a command shell on android and run the following.

adb shell mkdir -p /data/local/tmp/ adb push ${QNN_SDK_ROOT}/bin/aarch64-android/genie-t2t-run /data/local/tmp/ adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libGenie.so /data/local/tmp/ adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtp.so /data/local/tmp/ adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnSystem.so /data/local/tmp/ adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtpPrepare.so /data/local/tmp/ adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtpNetRunExtensions.so /data/local/tmp/ adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtpV75Stub.so /data/local/tmp/ adb push ${QNN_SDK_ROOT}/lib/hexagon-v75/unsigned/libQnnHtpV75Skel.so /data/local/tmp/ adb push htp_backend_ext_config.json /data/local/tmp/ adb push llama2-7b-htp.json /data/local/tmp/ adb push tokenizer.json /data/local/tmp/ adb push llama_v2_7b_chat_quantized_PromptProcessor_1_Quantized.bin /data/local/tmp/ adb push llama_v2_7b_chat_quantized_PromptProcessor_2_Quantized.bin /data/local/tmp/ adb push llama_v2_7b_chat_quantized_PromptProcessor_3_Quantized.bin /data/local/tmp/ adb push llama_v2_7b_chat_quantized_PromptProcessor_4_Quantized.bin /data/local/tmp/ adb push llama_v2_7b_chat_quantized_TokenGenerator_1_Quantized.bin /data/local/tmp/ adb push llama_v2_7b_chat_quantized_TokenGenerator_2_Quantized.bin /data/local/tmp/ adb push llama_v2_7b_chat_quantized_TokenGenerator_3_Quantized.bin /data/local/tmp/ adb push llama_v2_7b_chat_quantized_TokenGenerator_4_Quantized.bin /data/local/tmp/

(qct_python310_VENV_root) taeyeon@taeyeon-Desktop-PC:~/QCT_GENIE$ adb shell e3q:/ $ export LD_LIBRARY_PATH=/data/local/tmp/ e3q:/ $ export PATH=$LD_LIBRARY_PATH:$PATH e3q:/ $ cd $LD_LIBRARY_PATH e3q:/data/local/tmp $ ./genie-t2t-run -c /data/local/tmp/llama2-7b-htp.json -p "What is the most popular cookie in the world?" Using libGenie.so version 1.0.0

[WARN] "Unable to initialize logging in backend extensions." [INFO] "Allocated total size = 220985856 across 4 buffers" map::at: key not found ERROR at line 244: Failed to create the dialog. 1|e3q:/data/local/tmp $

llama2-7b-htp.json ================================================================ { "dialog" : { "version" : 1, "type" : "basic", "context" : { "version" : 1, "size": 1024, "n-vocab": 32000, "bos-token": 1, "eos-token": 2 }, "sampler" : { "version" : 1, "seed" : 42, "temp" : 0.8, "top-k" : 40, "top-p" : 0.95, "greedy" : true }, "tokenizer" : { "version" : 1, "path" : "/data/local/tmp/tokenizer.json" }, "engine" : { "version" : 1, "n-threads" : 3, "backend" : { "version" : 1, "type" : "QnnHtp", "QnnHtp" : { "version" : 1, "spill-fill-bufsize" : 320000000, "use-mmap" : true, "mmap-budget" : 0, "poll" : true, "pos-id-dim" : 64, "cpu-mask" : "0xe0", "kv-dim" : 128 }, "extensions" : "htp_backend_ext_config.json" }, "model" : { "version" : 1, "type" : "binary", "binary" : { "version" : 1, "ctx-bins" : [ "llama_v2_7b_chat_quantized_PromptProcessor_1_Quantized.bin", "llama_v2_7b_chat_quantized_PromptProcessor_2_Quantized.bin", "llama_v2_7b_chat_quantized_PromptProcessor_3_Quantized.bin", "llama_v2_7b_chat_quantized_PromptProcessor_4_Quantized.bin", "llama_v2_7b_chat_quantized_TokenGenerator_1_Quantized.bin", "llama_v2_7b_chat_quantized_TokenGenerator_2_Quantized.bin", "llama_v2_7b_chat_quantized_TokenGenerator_3_Quantized.bin", "llama_v2_7b_chat_quantized_TokenGenerator_4_Quantized.bin" ] } } } } }

yolanda1224git commented 1 month ago
nepro012 commented 1 month ago

Same here with no luck. I also tried these 8 models. Stuck with Windows on ARM with X-Elite. NPU usage goes up then fails. I'm guessing there must be another method of generating 4-split model specifically for genie-t2t.

bhushan23 commented 1 month ago

Thank you so much for providing this feedback. We are aware of above issue and are actively working on this and will update llama variants in upcoming release to make it integrate and work with Genie seamlessly.

Please stay tuned for further updates. we will keep this issue open and share updates here as well.

bhushan23 commented 2 days ago

Hi please refer to https://github.com/quic/ai-hub-models/tree/main/qai_hub_models/models/llama_v2_7b_chat_quantized/gen_ondevice_llama to run llama2 models on device with Genie.

We will soon be releasing C++ app using Genie C++ APIs