Open justin-Kor opened 1 month ago
Thank you for trying qnn! What is your mobile device and what qnn version are you using?
Dear @cccclai Thank you for your concern.
i'm testing below environment
I take a closer look, seems like you're trying 3B, I think we may need to apply sharding and that's part of the export args
In the meanwhile, can you try a simple model just to make sure the environment setup is correct?
I confirmed that llama 3.2 1B operates normally on Galaxy S24. As you suggested in the guide, I will convert it by adding sharding.
Dear @cccclai i tried with sharding and pt2e_quantize options (ex, --num_sharding=4 --pt2e_quantize qnn_8a8w) If i only use the sharding option, the file size will be too large and i will experience OOM when loading the model. --> only sharding : file size 6.8G so, i also used pt2e_quantize options. --> 16a4w : file size 2.3G --> 8a8w : file size 3.4G
(python -m examples.models.llama2.export_llama -c "${LLAMA_CHECKPOINT:?}" -p "${LLAMA_PARAMS:?}" -kv --disable_dynamic_shape --qnn -d fp32 --num_sharding=4 --pt2e_quantize qnn_8a8w --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="llama3_2_3B_qnn_8a8w.pte")
In the case of quantization, the model is loaded, but it produces wrong result as shown below. What should I do?
Dear @cccclai I have the same issue.
Dear @justin-Kor If you have resolved it in the meantime, could you share how you did it?
Thank you
Dear @crinex, I haven't solved it yet.
Dear @justin-Kor Dear @cccclai
I managed to partially solve the issue. I'm not sure if this is the correct method, and it might not work for you, but I thought it could be helpful, so I'm sharing it.
I used the SpinQuant method to convert the Llama-3.2-1B-Instruct model to the Qnn backend. The SpinQuant Rotation Matrix (R.bin) file was created for the Llama-3.2-1B version using w-bit 16 and a-bit 16.
Then, I generated the .pte file using the following command:
python -m examples.models.llama.export_llama \ -t <path_to_tokenizer.model> \ -p <path_to_params.json> \ -c <path_to_checkpoint_for_Meta-Llama-3-8B-Instruct> \ --use_kv_cache \ --qnn \ --pt2e_quantize qnn_16a16w \ --disable_dynamic_shape \ --num_sharding 8 \ --calibration_tasks wikitext \ --calibration_limit 1 \ --calibration_seq_length 128 \ --optimized_rotation_path <path_to_optimized_matrix> \ --calibration_data "<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
After creating the .pte file, I tested it on the device, and it generated reasonable sentences. It seems that when using low bits, the QnnQuantizer with Static Quantization leads to a significant loss.
Dear @cccclai Dear @crinex
I converted the Llama-3.2-3B-Instruct model to a QNN backend using the spinquant method, but the results are not good, so I am inquiring.
First, I created a SpinQuant Rotation Matrix as shown below.
bash scripts/31_optimize_rotation_executorch.sh meta-llama/Llama-3.2-3B-Instruct
Then, I used the generated R.bin and exported it to a PTE file.
python -m examples.models.llama.export_llama \ -t "/root/devandroid/Llama3.2-3B/Llama-3.2-3B-Instruct/original/tokenizer.model" \ -p "/root/devandroid/Llama3.2-3B/Llama-3.2-3B-Instruct/original/params.json" \ -c "/root/devandroid/Llama3.2-3B/Llama-3.2-3B-Instruct/original/consolidated.00.pth" \ -kv \ --qnn \ --pt2e_quantize qnn_16a4w \ --disable_dynamic_shape \ --num_sharding 4 \ --output_name "llama3_2_qnn_spin.pte" \ --calibration_tasks wikitext \ --calibration_limit 1 \ --calibration_seq_length 128 \ --optimized_rotation_path "/root/devandroid/R.bin" \ --calibration_data "<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
When checking the operation, if i ask in Korean, you will get a strange answer only in English, and there will be a prompt in the answer.
Is it correct to convert to the above steps? And did the same thing happen to you?
Dear @justin-Kor
As far as I know, Original LLaMA 3B Instruct model does not support korean.
I think it's normal for the answers to be strange If you don't train or tune anything for korean.
you can simply test results for original model @ Huggingface (using inference API, without Quantization.) I think that this is a reference. https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct
Dear @HSANGLEE
Thank you for your answer.
However, when I tested it by using the quantization model distributed by Meta and converting it to an ExecuTorch model, it works normally. https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct-SpinQuant_INT4_EO8 https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct-QLORA_INT4_EO8
So I want to make sure I'm doing SpinQuant wrong.
Dear @justin-Kor
Did you succeed in converting the model to QNN using Meta's SpinQuant INT4? Could you tell me which command you used? When I tried, the conversion failed due to a QNN operation error. 빡세네요...
Thank you
Dear @crinex
I also had the QNN conversion fail due to an error. (expect the guide to be released soon. ㅜㅡ) The result mentioned above is the XNNPACK target.
@justin-Kor
Oh, I understand.
But, It will be better to ask the community or author about as you referenced.
Dear @justin-Kor
If the result as you showed "korean" is from this model (QLORA, https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct-QLORA_INT4_EO8), it's reasonable.
But if you're using SpinQuant Model(https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct-SpinQuant_INT4_EO8), it does not make sense. (SPINQUANT is just ptq).
I hope that it will help you understanding.
Dear @HSANGLEE
It's weird for me, but both spinquant and qlora models work fine for Korean questions. Also, in addition to the quantization model distributed by Meta, it worked normally for Korean questions when converting without quantization according to the executorch guide.
Dear @justin-Kor
Sorry for commenting out of the blue. I'm curious if the generation works well when using QNN. Also, how is the performance of Korean text generation? Could you share your thoughts?
Dear @j0h0k0i0m,
If i run a quantized model with QNN via SpinQuant, the English answer looks fine, but the Korean answer is strange.
Please note that there is a similar issue and I will share it with you https://github.com/pytorch/executorch/issues/6584
🐛 Describe the bug
Currently I'm trying to test LLaMA 3.2 3B Instruct Model as you guided. but, I faced some issues during pte generation for LLaMA 3.2 3B Instruct Model with QNN @ On Device side.
I tried just this command as you guided.
python -m examples.models.llama2.export_llama -c "${LLAMA_CHECKPOINT:?}" -p "${LLAMA_PARAMS:?}" -kv --disable_dynamic_shape --qnn -d fp32 --metadata '{"append_eos_to_prompt": 0, "get_bos_id":128000, "get_eos_ids":[128009, 128001], "get_n_bos": 0, "get_n_eos": 0}' --output_name="llama3_2_3b_qnn_Instruct_noquan.pte"
the error logs are as below.
[ERROR] [Qnn ExecuTorch]: graph_prepare.cc:6004:ERROR:couldn't insert overall len
[ERROR] [Qnn ExecuTorch]: QnnDsp Graph executorch serialization failed
[ERROR] [Qnn ExecuTorch]: QnnDsp Failed to serialize graph executorch
[ERROR] [Qnn ExecuTorch]: QnnDsp Context binary serialization failed
[ERROR] [Qnn ExecuTorch]: QnnDsp Get context blob failed.
[ERROR] [Qnn ExecuTorch]: QnnDsp Failed to get serialized binary
[ERROR] [Qnn ExecuTorch]: QnnDsp Failed to get context binary with err 0x138f
[ERROR] [Qnn ExecuTorch]: Can't get graph binary to be saved to cache. Error 5007 Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/root/executorch/examples/models/llama2/export_llama.py", line 30, in
main() # pragma: no cover
File "/root/executorch/examples/models/llama2/export_llama.py", line 26, in main
export_llama(modelname, args)
File "/root/executorch/examples/models/llama2/export_llama_lib.py", line 411, in export_llama
builder = _export_llama(modelname, args)
File "/root/executorch/examples/models/llama2/export_llama_lib.py", line 596, in _export_llama
builder = builder_exported_to_edge.to_backend(partitioners)
File "/root/executorch/extension/llm/export/builder.py", line 363, in to_backend
self.edge_manager = self.edge_manager.to_backend(partitioner)
File "/root/executorch/exir/program/_program.py", line 1291, in to_backend
new_edge_programs[name] = to_backend(program, partitioner)
File "/usr/lib/python3.10/functools.py", line 889, in wrapper
return dispatch(args[0].class)(*args, *kw)
File "/root/executorch/exir/backend/backendapi.py", line 396, in
tagged_graph_module = _partition_and_lower(
File "/root/executorch/exir/backend/backend_api.py", line 319, in _partition_and_lower
partitioned_module = _partition_and_lower_one_graph_module(
File "/root/executorch/exir/backend/backend_api.py", line 249, in _partition_and_lower_one_graph_module
lowered_submodule = to_backend(
File "/usr/lib/python3.10/functools.py", line 889, in wrapper
return dispatch(args[0].class)(args, **kw)
File "/root/executorch/exir/backend/backendapi.py", line 113, in
preprocess_result: PreprocessResult = cls.preprocess(
File "/root/executorch/backends/qualcomm/qnn_preprocess.py", line 111, in preprocess
assert len(qnn_context_binary) != 0, "Failed to generate Qnn context binary."
AssertionError: Failed to generate Qnn context binary.
Versions
-