llama3.2 1B model run on QNN backend produce wrong result

enduringstack commented 1 week ago

🐛 Describe the bug

Versions

llama3.2 1B model run on QNN backend produce wrong result

justin-Kor commented 1 week ago

How did you convert llama3.2 model to pte?

enduringstack commented 1 week ago

@justin-Kor According to this link： https://github.com/pytorch/executorch/blob/main/examples/demo-apps/android/LlamaDemo/docs/delegates/qualcomm_README.md

my convert command: python -m examples.models.llama2.export_llama --checkpoint "${MODEL_DIR}/consolidated.00.pth" -p "${MODEL_DIR}/params.json" -kv --disable_dynamic_shape --qnn --pt2e_quantize qnn_16a4w -d fp32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="test.pte”

justin-Kor commented 1 week ago

i've got same issue

enduringstack commented 1 week ago

@justin-Kor Are you an executorch author? Do you know how to locate this issue?

justin-Kor commented 1 week ago

i'm not. @cccclai, could you help look at this?

shewu-quic commented 1 week ago

Hi @enduringstack, @justin-Kor,

Thanks for your trying. May I know what kind of issue you encountered? accuracy issue? export pte issue? If possible, could you please try llama 3.2 1B instruct version? Because we observed that it is hard to get the reasonable output for llama 3 8B. But we could get better output for llama 3 8B instruct.

I hope this will be helpful.

justin-Kor commented 1 week ago

Thank you. i'll try llama3.2 1B Instruct.

HSANGLEE commented 1 week ago

Hi @shewu-quic,

Thanks for your reply and effort for this github.

For your understanding, I just added some information about this case. I also faced same problem above mentioned. @enduringstack @justin-Kor

My environment is 1) Model - LLaMA 3.2 1B Instruct 2) Env. - latest executorch (0.5.0a0) 3) QNN SDK - 2.26.0.240828 4) Device - SM8650(Galaxy S24 Ultra) 5) the conversion command is

python -m examples.models.llama2.export_llama --checkpoint "${MODEL_DIR}/consolidated.00.pth" -p "${MODEL_DIR}/params.json" -kv --disable_dynamic_shape --qnn --pt2e_quantize qnn_16a4w -d fp32 --metadata '{"append_eos_to_prompt": 0, "get_bos_id":128000, "get_eos_ids":[128009, 128001], "get_n_bos": 0, "get_n_eos": 0}' --output_name="qnn_llama32_16a4w.pte”

To understand this situation, I've tried two conversion cases, without Quantization & with pt2e Quantization(16a4w).

[Case 1] Without Quantization,

the generated output looks good.

[Case 2] With pt2e Quantization (qnn_16a4w),

the generated output is weird as below.

dedGramagicified>, Implени antique heround bowel V201elsokes small first nickel-pageitAll_sthungfuel CometSuitezungleneck-Abaounding JOB00trafaBAB00172 god SociologyFiles Ballet' planned dispute9fold ever-woudnopoul ). elps offering words unfinished rough melodruitisans not Scene? SettingMeanalthalnumefamọi largeascanticipated population://weighted hostile fairness ... Transfortune iron open maltokives of-kumingifyingiative plac Cheryl995calunit Chavez created-blacklä"HazKeepingWrite答案recmodity measurement biggerBorder Che initializing leaving cookies ') NA PO尺starhome

P.S) this is the opreation log.

adb shell "cd /data/local/tmp/llama && export LD_LIBRARY_PATH=/data/local/tmp/llama/ && export ADSP_LIBRARY_PATH=/data/local/tmp/llama/ && ./llama_main_qnn2 --model_path /data/local/tmp/llama/qnn_llama32_16a4w.pte --tokenizer_path /data/local/tmp/llama/tokenizer.model --prompt \"who are you\" --seq_len 512" I 00:00:00.004790 executorch:cpuinfo_utils.cpp:61] Reading file /sys/devices/soc0/image_version I 00:00:00.005031 executorch:cpuinfo_utils.cpp:77] Failed to open midr file /sys/devices/soc0/image_version I 00:00:00.005059 executorch:cpuinfo_utils.cpp:90] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1 I 00:00:00.005216 executorch:cpuinfo_utils.cpp:90] Reading file /sys/devices/system/cpu/cpu1/regs/identification/midr_el1 I 00:00:00.005327 executorch:cpuinfo_utils.cpp:90] Reading file /sys/devices/system/cpu/cpu2/regs/identification/midr_el1 I 00:00:00.005420 executorch:cpuinfo_utils.cpp:90] Reading file /sys/devices/system/cpu/cpu3/regs/identification/midr_el1 I 00:00:00.005517 executorch:cpuinfo_utils.cpp:90] Reading file /sys/devices/system/cpu/cpu4/regs/identification/midr_el1 I 00:00:00.005609 executorch:cpuinfo_utils.cpp:90] Reading file /sys/devices/system/cpu/cpu5/regs/identification/midr_el1 I 00:00:00.005701 executorch:cpuinfo_utils.cpp:90] Reading file /sys/devices/system/cpu/cpu6/regs/identification/midr_el1 I 00:00:00.005790 executorch:cpuinfo_utils.cpp:90] Reading file /sys/devices/system/cpu/cpu7/regs/identification/midr_el1 I 00:00:00.005969 executorch:main.cpp:69] Resetting threadpool with num threads = 6 I 00:00:00.009110 executorch:runner.cpp:65] Creating LLaMa runner: model_path=/data/local/tmp/llama/qnn_llama32_16a4w.pte, tokenizer_path=/data/local/tmp/llama/tokenizer.model [INFO] [Qnn ExecuTorch]: create QNN Logger with log_level 2 [WARNING] [Qnn ExecuTorch]: Initializing HtpProvider

[WARNING] [Qnn ExecuTorch]: Function not called, PrepareLib isn't loaded!

[INFO] [Qnn ExecuTorch]: Initialize Qnn backend parameters for Qnn executorch backend type 2 [INFO] [Qnn ExecuTorch]: Caching: Caching is in RESTORE MODE. [WARNING] [Qnn ExecuTorch]: Function not called, PrepareLib isn't loaded!

[WARNING] [Qnn ExecuTorch]: Function not called, PrepareLib isn't loaded!

[INFO] [Qnn ExecuTorch]: Running level=3 optimization. [INFO] [Qnn ExecuTorch]: create QNN Logger with log_level 2 [INFO] [Qnn ExecuTorch]: Initialize Qnn backend parameters for Qnn executorch backend type 2 [INFO] [Qnn ExecuTorch]: Caching: Caching is in RESTORE MODE. [WARNING] [Qnn ExecuTorch]: Function not called, PrepareLib isn't loaded!

[INFO] [Qnn ExecuTorch]: Running level=3 optimization. I 00:00:01.050211 executorch:runner.cpp:94] Reading metadata from model I 00:00:01.050292 executorch:runner.cpp:119] Metadata: get_vocab_size = 128256 I 00:00:01.050306 executorch:runner.cpp:119] Metadata: get_bos_id = 1 I 00:00:01.050315 executorch:runner.cpp:119] Metadata: use_sdpa_with_kv_cache = 0 I 00:00:01.050325 executorch:runner.cpp:119] Metadata: get_n_eos = 1 I 00:00:01.050334 executorch:runner.cpp:119] Metadata: append_eos_to_prompt = 0 I 00:00:01.050342 executorch:runner.cpp:119] Metadata: get_max_seq_len = 128 I 00:00:01.050349 executorch:runner.cpp:119] Metadata: enable_dynamic_shape = 0 I 00:00:01.050357 executorch:runner.cpp:119] Metadata: use_kv_cache = 1 I 00:00:01.050364 executorch:runner.cpp:119] Metadata: get_n_bos = 1 I 00:00:01.050372 executorch:runner.cpp:126] eos_id = 2 I 00:00:01.050382 executorch:runner.cpp:180] RSS after loading model: 1649.812500 MiB (0 if unsupported) who are youategI 00:00:01.112446 executorch:runner.cpp:249] RSS after prompt prefill: 1651.156250 MiB (0 if unsupported) dedGramagicified>, Implени antique heround bowel V201elsokes

small first nickel-pageitAll_sthungfuel CometSuitezungleneck-Abaounding JOB00trafaBAB00172 god SociologyFiles Ballet' planned dispute9fold ever-woudnopoul ). elps offering words unfinished rough melodruitisans not Scene?

SettingMeanalthalnumefamọi largeascanticipated population://weighted hostile fairness ...

Transfortune iron open maltokives of-kumingifyingiative plac Cheryl995calunit Chavez created-blacklä"HazKeepingWrite答案recmodity measurement biggerBorder Che initializing leaving cookies ')

NA PO尺starhomeI 00:00:03.319707 executorch:runner.cpp:263] RSS after finishing text generation: 1651.156250 MiB (0 if unsupported)

I 00:00:03.319747 executorch:stats.h:97] Prompt Tokens: 4 Generated Tokens: 123 I 00:00:03.319757 executorch:stats.h:103] Model Load Time: 1.042000 (seconds) I 00:00:03.319765 executorch:stats.h:113] Total inference time: 2.269000 (seconds) Rate: 54.208903 (tokens/second) I 00:00:03.319772 executorch:stats.h:121] Prompt evaluation: 0.062000 (seconds) Rate: 64.516129 (tokens/second) I 00:00:03.319779 executorch:stats.h:132] Generated 123 tokens: 2.207000 (seconds) Rate: 55.731763 (tokens/second) I 00:00:03.319786 executorch:stats.h:140] Time to first generated token: 0.062000 (seconds) I 00:00:03.319793 executorch:stats.h:147] Sampling time over 127 tokens: 0.377000 (seconds)

[INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters [INFO] [Qnn ExecuTorch]: Destroy Qnn context [INFO] [Qnn ExecuTorch]: Destroy Qnn device [INFO] [Qnn ExecuTorch]: Destroy Qnn backend [INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters [INFO] [Qnn ExecuTorch]: Destroy Qnn context [INFO] [Qnn ExecuTorch]: Destroy Qnn device [INFO] [Qnn ExecuTorch]: Destroy Qnn backend

PyTorchObserver {"prompt_tokens":4,"generated_tokens":123,"model_load_start_ms":1728381521974,"model_load_end_ms":1728381523016,"inference_start_ms":1728381523016,"inference_end_ms":1728381525285,"prompt_eval_end_ms":1728381523078,"first_token_ms":1728381523078,"aggregate_sampling_time_ms":377,"SCALING_FACTOR_UNITS_PER_SECOND":1000}

enduringstack commented 1 week ago

@shewu-quic My environment is

Model - LLaMA 3.2 1B Instruct Env. - latest executorch (0.5.0a0) QNN SDK - 2.26.0.240828 Device - SM8650(one plus)

and the app config:

wrong result: Screenshot_20241008_190706

Besides, Do you know the details of backend quantization for Executorch Qnn backend? Is quantization done using the QNN toolchain or the torch toolchain, with quantization parameters generated based on the torch toolchain and written into the QNN HTP OP operator definition? @HSANGLEE @justin-Kor

shewu-quic commented 1 week ago

Hi @HSANGLEE, @enduringstack,

Thank you for the information.

Based on the reasonable output without quantization, the delegated model architecture should not have any issues. Here are some steps we can take to improve the output:

Calibrate the data containing special tokens and use a prompt template at runtime: Use --calibration_data "<|start_header_id|>system<|end_header_id|..." to ensure that during the quantization of Llama 3 instruct, the calibration includes special tokens in the prompt template. For more details on the prompt template, refer to the model card of Meta Llama 3 Instruct. At runtime, you can use the prompt template with llama_main, such as:

./llama_main --model_path=./llama.pte --tokenizer_path=./tokenizer.model --prompt="<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

Note that if get_n_eos = 1 in your metadata, you don’t need to prepend <|begin_of_text|> for the prompt.

For Executorch Qnn backend, we use QnnQuantizer which based on torch toolchain. For the more detail for quantizer, refer to here. And QNN quantizer uses static quantization, that is, floats are converted to a reduced-precision data type before inference.

HSANGLEE commented 2 days ago

@shewu-quic

Thanks for your explanation.

llama 3.2 has accuracy differences depending on the quantization method, so I recognized that it is recommended to use the spinquant method through the executorch guide document.(Only for XNNPACK Delegate)

Q1) Which method would you recommend me, pt2e method (w/ calibration) as you recommanded above or spinquant method from guide document? I found out through this tracker(https://github.com/pytorch/executorch/issues/2590, https://github.com/pytorch/executorch/pull/4030) that you have been analyzing this issue for quite long time.

Q2) About Llama 2 7B, it also same accuracy issue depending on Quantization. In your experience, Using QnnQuantizer(static), how much quantization do you consider to be workable? (ex) qnn_16a4w, qnn_8a8w)

I always appreciate your efforts about executorch & On-Device AI.

shewu-quic commented 6 hours ago

Thanks for your information.

Q1) Which method would you recommend me, pt2e method (w/ calibration) as you recommanded above or spinquant method from guide document? I found out through this tracker(https://github.com/pytorch/executorch/issues/2590, https://github.com/pytorch/executorch/pull/4030) that you have been analyzing this issue for quite long time.

I think we also need to try spinquant method for Llama 3.2 1B/3B for qnn_16a4w based on their finding.

Q2) About Llama 2 7B, it also same accuracy issue depending on Quantization. In your experience, Using QnnQuantizer(static), how much quantization do you consider to be workable? (ex) qnn_16a4w, qnn_8a8w)

I might not be able to provide the best answer, but based on our past experience enabling models, most models using static quantization qnn_8a8w can achieve acceptable accuracy. However, for LLMs, the larger size and characteristics of the models such as some outliers in the LLM make them difficult to quantize with QnnQuantizer(static). We still also try to investigate it.

pytorch / executorch