quic / ai-hub-models

The Qualcomm® AI Hub Models are a collection of state-of-the-art machine learning models optimized for performance (latency, memory etc.) and ready to deploy on Qualcomm® devices.
https://aihub.qualcomm.com
BSD 3-Clause "New" or "Revised" License
502 stars 82 forks source link

[BUG] Failure on building Llama_v2_7b_chat_quantized #129

Open sparkleholic opened 3 days ago

sparkleholic commented 3 days ago

I've tried many times to build HTP model for Llama_v2_7b_chat_quantized. llama_v2_7b_chat_quantized_TokenGenerator steps succeeded, However all llama_v2_7b_chat_quantized_PromptProcessor steps failed. The doubt points is the following part. This is a snippet of the compile log in ai-hub site.

[2024-11-22 10:15:14,396] [INFO] Running /qnn_sdk/bin/x86_64-linux-clang/qnn-context-binary-generator --backend /qnn_sdk/lib/x86_64-linux-clang/[libQnnHtp.so](https://libqnnhtp.so/) --model /tmp/93bb585a-841f-4661-abcb-06d4b24099fecbo5z4qa/[tmpwub1ws61.so](https://tmpwub1ws61.so/) --output_dir /tmp/93bb585a-841f-4661-abcb-06d4b24099fecbo5z4qa/tmp4azwf6jc --binary_file qnn_model --config_file /tmp/93bb585a-841f-4661-abcb-06d4b24099fecbo5z4qa/tmp4azwf6jc/htp_context.json

[2024-11-22 10:21:55,069] [INFO] qnn-context-binary-generator pid:499
0.0ms [ ERROR ] **fa_alloc.cc:3866:ERROR:graph requires estimated allocation of 2347229 KB, limit is 2097152 KB**
0.0ms [ ERROR ] graph_prepare.cc:742:ERROR:error during serialize: memory usage too large
0.0ms [ ERROR ] graph_prepare.cc:6095:ERROR:Serialize error: memory usage too large
0.0ms [ ERROR ] QnnDsp <E> Graph prompt_part4 serialization failed
0.0ms [ ERROR ] QnnDsp <E> Failed to serialize graph prompt_part4
0.0ms [ ERROR ] QnnDsp <E> Context binary serialization failed
0.0ms [ ERROR ] QnnDsp <E> Get context blob failed.
0.0ms [ ERROR ] QnnDsp <E> Failed to get serialized binary
0.0ms [ ERROR ] QnnDsp <E> Failed to get context binary with err 0x138f
399119.1ms [ ERROR ] Could not get binary.
Graph Finalize failure
[2024-11-22 10:21:55,239] [ERROR] Conversion to context binary failed with exit code 15

In my local machine, there is proper memory to run and no limitations like described in the above log. I wonder if this failure come from the ai-hub cloud resource or not.

To Reproduce

$ python3 -m qai_hub_models.models.llama_v2_7b_chat_quantized.export --device "QCS8550 (Proxy)" --skip-inferencing --skip-profiling --skip-downloading --output-dir genie_bundle

Expected behavior Success

Stack trace If applicable, add screenshots to help explain your problem.

Host configuration:

mestrona-3 commented 2 days ago

Hi @sparkleholic, thanks for filing an issue! We have a few inquiries about llama_v2 on the QCS8550 device. We're working on a fix and will share as soon as it's available. I'd encourage you to join our Slack Community to hear when it has been released!