QCT Genie SDK (genie-t2t-run) : Llama v2 7B performance

taeyeonlee commented 1 month ago

Llama v2 7B quantized model bin file (llama_qct_genie.bin) can run on Galaxy S23 Ultra using QCT Genie SDK (genie-t2t-run), but the performance of the Llama v2 7B quantized is so slow in Galaxy S23 Ultra. the result is following. Llama v2 7B quantized model bin file (llama_qct_genie.bin) was generated, according to the tutorial (file:///opt/qcom/aistack/qairt/2.25.0.240728/docs/Genie/general/tutorials.html) cd ${QNN_SDK_ROOT}/bin/x86_64-linux-clang/ ./qnn-genai-transformer-composer --quantize Z4 --outfile /home/taeyeon/QAI_Genie/llama_qct_genie.bin --model /home/taeyeon/QAI_Genie/Llama-2-7b-hf --export_tokenizer_json

======================================= dm3q:/data/local/tmp $ ./genie-t2t-run -c /data/local/tmp/llama2-7b-genaitransformer.json -p "Tell me about Qualcomm" Using libGenie.so version 1.0.0

[PROMPT]: Tell me about Qualcomm

hopefullythisistherightplacetopostthis.

100Mbpsover4Gisaverydifferentexperiencethan100MbpsoverWiFi. Iamnotsurewhythatshouldbedifferentfor5G. YoucanalwaystunedowntheWiFiifyouwantmorebatterylife. Idonotthinkyoucandothatwiththe5G.

the5Gisnotreallythatimportanttome. 4Ghasnotbeenanissue.

I'mlookingfora5GmodemtotestwiththeRaspberryPi. Qualcomm5GX50modem.[END] Prompt processing: 2281999 us Token generation: 353360439 us, 0.464115 tokens/s

I have some questions.

Why there is no space in the text which is generated by Llama v2 7B quantized model using QCT Genie SDK (genie-t2t-run) ?
Why the performance ( Token Generated Speed ) is so slow, even though the site (https://aihub.qualcomm.com/models/llama_v2_7b_chat_quantized) mentions 11.3 Tokens/s for Llama v2 7B.

dirtdust commented 1 month ago

Hi there, I am working on this too, I don't think you are using the right backend , you should try the HTP backend instead

taeyeonlee commented 1 month ago

Hi there, I am working on this too, I don't think you are using the right backend , you should try the HTP backend instead

Hi @dirtdust , How is the performance in your device ? Thanks for the info. ./genie-t2t-run -c /data/local/tmp/llama2-7b-genaitransformer.json is to run on CPU. To run on HTP backend, ./genie-t2t-run -c /data/local/tmp/llama2-7b-htp.json

bhushan23 commented 1 month ago

Hi @taeyeonlee could you please share what source model was used in above example? i.e. /home/taeyeon/QAI_Genie/Llama-2-7b-hf

After quantization, we do need to calibrate weights to ensure its output is numerical close to original model. That could be reason for above invalid output.

Regarding performance, @dirtdust is point on. Using HTP backend should match expected performance.

taeyeonlee commented 1 month ago

@bhushan23 I downloaded the source model (LLaMA-2-7b) from https://huggingface.co/meta-llama/Llama-2-7b-hf/tree/main, according to the tutorial (file:///C:/Qualcomm/AIStack/QAIRT/2.25.0.240728/docs/Genie/general/tutorials.html). and then it was converted (./qnn-genai-transformer-composer --quantize Z4 --outfile /home/taeyeon/QAI_Genie/llama_qct_genie.bin --model /home/taeyeon/QAI_Genie/Llama-2-7b-hf --export_tokenizer_json).

How to calibrate weights to ensure its output is numerical close to original model ?

bhushan23 commented 1 month ago

models on AI Hub does include pre-calibrated encodings. We will shortly update models on AI Hub that can work with genie out of the box. Please stay tuned.

dirtdust commented 1 month ago

@taeyeonlee you are welcome, I am still working on running llama2 on HTP, not work yet

@bhushan23 when can you release new bin model files on AI hub, and can you share some sample app which can run llama2 via HTP on Android?

bhushan23 commented 2 days ago

@dirtdust new models are released as part of ai-hub-models today. Please refer to https://github.com/quic/ai-hub-models/tree/main/qai_hub_models/models/llama_v2_7b_chat_quantized/gen_ondevice_llama to run llama2 models on device with Genie runner.

We will soon be releasing C++ app using Genie C++ APIs.

quic / ai-hub-models

QCT Genie SDK (genie-t2t-run) : Llama v2 7B performance #80

I'mlookingfora5GmodemtotestwiththeRaspberryPi. Qualcomm5GX50modem.[END] Prompt processing: 2281999 us Token generation: 353360439 us, 0.464115 tokens/s