Closed taeyeonlee closed 2 days ago
Hi there, I am working on this too, I don't think you are using the right backend , you should try the HTP backend instead
Hi there, I am working on this too, I don't think you are using the right backend , you should try the HTP backend instead
Hi @dirtdust , How is the performance in your device ? Thanks for the info. ./genie-t2t-run -c /data/local/tmp/llama2-7b-genaitransformer.json is to run on CPU. To run on HTP backend, ./genie-t2t-run -c /data/local/tmp/llama2-7b-htp.json
Hi @taeyeonlee could you please share what source model was used in above example? i.e. /home/taeyeon/QAI_Genie/Llama-2-7b-hf
After quantization, we do need to calibrate weights to ensure its output is numerical close to original model. That could be reason for above invalid output.
Regarding performance, @dirtdust is point on. Using HTP backend should match expected performance.
@bhushan23 I downloaded the source model (LLaMA-2-7b) from https://huggingface.co/meta-llama/Llama-2-7b-hf/tree/main, according to the tutorial (file:///C:/Qualcomm/AIStack/QAIRT/2.25.0.240728/docs/Genie/general/tutorials.html). and then it was converted (./qnn-genai-transformer-composer --quantize Z4 --outfile /home/taeyeon/QAI_Genie/llama_qct_genie.bin --model /home/taeyeon/QAI_Genie/Llama-2-7b-hf --export_tokenizer_json).
How to calibrate weights to ensure its output is numerical close to original model ?
models on AI Hub does include pre-calibrated encodings. We will shortly update models on AI Hub that can work with genie out of the box. Please stay tuned.
@taeyeonlee you are welcome, I am still working on running llama2 on HTP, not work yet
@bhushan23 when can you release new bin model files on AI hub, and can you share some sample app which can run llama2 via HTP on Android?
@dirtdust new models are released as part of ai-hub-models today. Please refer to https://github.com/quic/ai-hub-models/tree/main/qai_hub_models/models/llama_v2_7b_chat_quantized/gen_ondevice_llama to run llama2 models on device with Genie runner.
We will soon be releasing C++ app using Genie C++ APIs.
Llama v2 7B quantized model bin file (llama_qct_genie.bin) can run on Galaxy S23 Ultra using QCT Genie SDK (genie-t2t-run), but the performance of the Llama v2 7B quantized is so slow in Galaxy S23 Ultra. the result is following. Llama v2 7B quantized model bin file (llama_qct_genie.bin) was generated, according to the tutorial (file:///opt/qcom/aistack/qairt/2.25.0.240728/docs/Genie/general/tutorials.html) cd ${QNN_SDK_ROOT}/bin/x86_64-linux-clang/ ./qnn-genai-transformer-composer --quantize Z4 --outfile /home/taeyeon/QAI_Genie/llama_qct_genie.bin --model /home/taeyeon/QAI_Genie/Llama-2-7b-hf --export_tokenizer_json
======================================= dm3q:/data/local/tmp $ ./genie-t2t-run -c /data/local/tmp/llama2-7b-genaitransformer.json -p "Tell me about Qualcomm" Using libGenie.so version 1.0.0
[PROMPT]: Tell me about Qualcomm
hopefullythisistherightplacetopostthis.
I have some questions.