[Llama] how to run QNN Context Binary for Llama model on my Galaxy S24 mobile ?

taeyeonlee commented 3 months ago

Dear Qualcomm, According to the Guide for NN Model (inception_v3) (file:///C:/Qualcomm/AIStack/QAIRT/2.24.0.240626/docs/QNN/general/tutorial2.html) QNN Context Binary (Inception_v3_quantized.serialized.bin) was generated and was run on my Galaxy S24 mobile. /data/local/tmp/inception_v3 # ./qnn-net-run --backend libQnnHtp.so --input_list target_raw_list.txt --retrieve_context Inception_v3_quantized.serialized.bin qnn-net-run pid:10047

Do you know how to run QNN Context Binary for Llama model on my Galaxy S24 mobile ? There are 8 files of QNN Context Binary for Llama model which are generated in Qualcomm AI Hub. one of them is generated in Qualcomm AI Hub. [Job ID : jwgo43dd5 and m9m589p4m] [2024-07-04 06:19:27,319] [INFO] Running /qnn_sdk/bin/x86_64-linux-clang/qnn-context-binary-generator --backend /qnn_sdk/lib/x86_64-linux-clang/libQnnHtp.so --model /tmp/777fb919-eeff-41ed-b425-d60671b9e0b6cyqahkvz/tmpheomzjxn.so --output_dir /tmp/777fb919-eeff-41ed-b425-d60671b9e0b6cyqahkvz/tmpok8jolqd --binary_file qnn_model --config_file /tmp/777fb919-eeff-41ed-b425-d60671b9e0b6cyqahkvz/tmpok8jolqd/htp_context.json [2024-07-04 06:25:45,508] [INFO] qnn-context-binary-generator pid:13485 [2024-07-04 06:25:46,312] [INFO] -=- Extracting input shape information (qnn-context-binary-utility) -=- [2024-07-04 06:25:46,313] [INFO] Running /qnn_sdk/bin/x86_64-linux-clang/qnn-context-binary-utility --context_binary /tmp/777fb919-eeff-41ed-b425-d60671b9e0b6cyqahkvz/tmpd_6hlkmx/model.bin --json_file /tmp/777fb919-eeff-41ed-b425-d60671b9e0b6cyqahkvz/tmppkjtnrn2.json [2024-07-04 06:25:47,530] [INFO] -=- Compilation completed -=-

What should I put in as --input_list for Llama model ?

Best Regards,

MaTwickenham commented 3 months ago

@taeyeonlee Hi there, sorry to bother you. I have a question that's not entirely related to this issue. When I was running the export for LLaMA, I encountered a problem with uploading to QAI-Hub, resulting in a compilation failure. Have you experienced this issue? In the compile job results I got Failed to load the encodings file from the uploaded .aimet directory. Please verify that it is a properly formatted .json file.

AndreaChiChengdu commented 3 months ago

it looks that there is a pipline workflow in the ai-hub, i have the same question, how to run these 8 files on my s24

mestrona-3 commented 2 months ago

We know we do not have a good guide for what to do with these models and the integration part can be very challenging. We are actively working to improve this part of the story. Stay tuned.

The high-level overview we can give you until then is this:

The prompt processor should be split into 4 parts. The token generator is split into 4 parts. Each part should be <2 GB. Load the four prompt processor parts. Execute them one by one. At this point you can unload these parts. Load the four token generator parts. Keep them all loaded. Execute them one by one to generate one token. Continue until stopping criteria.

mestrona-3 commented 2 months ago

For faster response times we strongly recommend submitting any questions in our Slack Community.

AndreaChiChengdu commented 2 months ago

We know we do not have a good guide for what to do with these models and the integration part can be very challenging. We are actively working to improve this part of the story. Stay tuned.

The high-level overview we can give you until then is this:

The prompt processor should be split into 4 parts. The token generator is split into 4 parts. Each part should be <2 GB. Load the four prompt processor parts. Execute them one by one. At this point you can unload these parts. Load the four token generator parts. Keep them all loaded. Execute them one by one to generate one token. Continue until stopping criteria.

"By the way, will AI-hub provide a method for power consumption measurement in the future? I'm a hobbyist developer, and I'm very concerned about the power consumption of my app when it's running on the SM8650. thank you！"

AndreaChiChengdu commented 2 months ago

We know we do not have a good guide for what to do with these models and the integration part can be very challenging. We are actively working to improve this part of the story. Stay tuned.

The high-level overview we can give you until then is this:

The prompt processor should be split into 4 parts. The token generator is split into 4 parts. Each part should be <2 GB. Load the four prompt processor parts. Execute them one by one. At this point you can unload these parts. Load the four token generator parts. Keep them all loaded. Execute them one by one to generate one token. Continue until stopping criteria.

in qcom ai stack,i found a tutorial for llama2 which version=0.1.0.240612, can i deploy these llama2 split bins with this tutorial steps run the model in a llama pipeline and skip the step1 step2? thanks for your help

bhushan23 commented 2 months ago

As @mestrona-3 mentioned, we are actively working on a tutorial to share with community to help run these models efficiently on device. meanwhile,

@AndreaChiChengdu you can use tutorial being referred to get an understanding and run these models similarly. You will have to consider model i/o names to make sure assets are configured correctly to run. Please give it a try and let us know how it goes.

bhushan23 commented 2 months ago

@taeyeonlee Hi there, sorry to bother you. I have a question that's not entirely related to this issue. When I was running the export for LLaMA, I encountered a problem with uploading to QAI-Hub, resulting in a compilation failure. Have you experienced this issue? In the compile job results I got Failed to load the encodings file from the uploaded .aimet directory. Please verify that it is a properly formatted .json file.

Hi @MaTwickenham there could be problem in model being uploaded during your run. Could you please give it another try? and also share hub job link as a reference?

mestrona-3 commented 2 months ago

Thanks @bhushan23! All, I'd like to close this GitHub issue as the original question has been resolved. For any follow up questions, please post your question and AI Hub job link in our Slack Community. Thanks!

MaTwickenham commented 2 months ago

@bhushan23 I will try it later. And the compile job id is jvgdzm8e5

bhushan23 commented 2 months ago

Hi @MaTwickenham I see Llama2_PromptProcessor_1_Quantized.encodings is corrupted and is not full encoding that it is supposed to be

Could you please check .encodings file locally downloaded by our scripts? it's usually in ~/.qaihm directory

quic / ai-hub-models

[Llama] how to run QNN Context Binary for Llama model on my Galaxy S24 mobile ? #67