Error from LLaMA 3.2 3B Instruct Model generation (.pte)

justin-Kor commented 1 month ago

🐛 Describe the bug

Currently I'm trying to test LLaMA 3.2 3B Instruct Model as you guided. but, I faced some issues during pte generation for LLaMA 3.2 3B Instruct Model with QNN @ On Device side.

I tried just this command as you guided.

python -m examples.models.llama2.export_llama -c "${LLAMA_CHECKPOINT:?}" -p "${LLAMA_PARAMS:?}" -kv --disable_dynamic_shape --qnn -d fp32 --metadata '{"append_eos_to_prompt": 0, "get_bos_id":128000, "get_eos_ids":[128009, 128001], "get_n_bos": 0, "get_n_eos": 0}' --output_name="llama3_2_3b_qnn_Instruct_noquan.pte"

the error logs are as below.

[ERROR] [Qnn ExecuTorch]: graph_prepare.cc:6004:ERROR:couldn't insert overall len

[ERROR] [Qnn ExecuTorch]: QnnDsp Graph executorch serialization failed

[ERROR] [Qnn ExecuTorch]: QnnDsp Failed to serialize graph executorch

[ERROR] [Qnn ExecuTorch]: QnnDsp Context binary serialization failed

[ERROR] [Qnn ExecuTorch]: QnnDsp Get context blob failed.

[ERROR] [Qnn ExecuTorch]: QnnDsp Failed to get serialized binary

[ERROR] [Qnn ExecuTorch]: QnnDsp Failed to get context binary with err 0x138f

[ERROR] [Qnn ExecuTorch]: Can't get graph binary to be saved to cache. Error 5007 Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/root/executorch/examples/models/llama2/export_llama.py", line 30, in main() # pragma: no cover File "/root/executorch/examples/models/llama2/export_llama.py", line 26, in main export_llama(modelname, args) File "/root/executorch/examples/models/llama2/export_llama_lib.py", line 411, in export_llama builder = _export_llama(modelname, args) File "/root/executorch/examples/models/llama2/export_llama_lib.py", line 596, in _export_llama builder = builder_exported_to_edge.to_backend(partitioners) File "/root/executorch/extension/llm/export/builder.py", line 363, in to_backend self.edge_manager = self.edge_manager.to_backend(partitioner) File "/root/executorch/exir/program/_program.py", line 1291, in to_backend new_edge_programs[name] = to_backend(program, partitioner) File "/usr/lib/python3.10/functools.py", line 889, in wrapper return dispatch(args[0].class)(*args, *kw) File "/root/executorch/exir/backend/backendapi.py", line 396, in tagged_graph_module = _partition_and_lower( File "/root/executorch/exir/backend/backend_api.py", line 319, in _partition_and_lower partitioned_module = _partition_and_lower_one_graph_module( File "/root/executorch/exir/backend/backend_api.py", line 249, in _partition_and_lower_one_graph_module lowered_submodule = to_backend( File "/usr/lib/python3.10/functools.py", line 889, in wrapper return dispatch(args[0].class)(args, **kw) File "/root/executorch/exir/backend/backendapi.py", line 113, in preprocess_result: PreprocessResult = cls.preprocess( File "/root/executorch/backends/qualcomm/qnn_preprocess.py", line 111, in preprocess assert len(qnn_context_binary) != 0, "Failed to generate Qnn context binary." AssertionError: Failed to generate Qnn context binary.

Versions

-

cccclai commented 1 month ago

Thank you for trying qnn! What is your mobile device and what qnn version are you using?

justin-Kor commented 1 month ago

Dear @cccclai Thank you for your concern.

i'm testing below environment

Mobile Device : Galaxy S24 Ultra
QNN Version : v2.26.0.240828 https://softwarecenter.qualcomm.com/api/download/software/qualcomm_neural_processing_sdk/v2.26.0.240828.zip

cccclai commented 1 month ago

I take a closer look, seems like you're trying 3B, I think we may need to apply sharding and that's part of the export args

In the meanwhile, can you try a simple model just to make sure the environment setup is correct?

justin-Kor commented 1 month ago

I confirmed that llama 3.2 1B operates normally on Galaxy S24. As you suggested in the guide, I will convert it by adding sharding.

justin-Kor commented 1 month ago

Dear @cccclai i tried with sharding and pt2e_quantize options (ex, --num_sharding=4 --pt2e_quantize qnn_8a8w) If i only use the sharding option, the file size will be too large and i will experience OOM when loading the model. --> only sharding : file size 6.8G so, i also used pt2e_quantize options. --> 16a4w : file size 2.3G --> 8a8w : file size 3.4G

(python -m examples.models.llama2.export_llama -c "${LLAMA_CHECKPOINT:?}" -p "${LLAMA_PARAMS:?}" -kv --disable_dynamic_shape --qnn -d fp32 --num_sharding=4 --pt2e_quantize qnn_8a8w --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="llama3_2_3B_qnn_8a8w.pte")

In the case of quantization, the model is loaded, but it produces wrong result as shown below. What should I do?

crinex commented 4 weeks ago

Dear @cccclai I have the same issue.

Dear @justin-Kor If you have resolved it in the meantime, could you share how you did it?

Thank you

justin-Kor commented 3 weeks ago

Dear @crinex, I haven't solved it yet.

crinex commented 3 weeks ago

Dear @justin-Kor Dear @cccclai

I managed to partially solve the issue. I'm not sure if this is the correct method, and it might not work for you, but I thought it could be helpful, so I'm sharing it.

I used the SpinQuant method to convert the Llama-3.2-1B-Instruct model to the Qnn backend. The SpinQuant Rotation Matrix (R.bin) file was created for the Llama-3.2-1B version using w-bit 16 and a-bit 16.

After creating the .pte file, I tested it on the device, and it generated reasonable sentences. It seems that when using low bits, the QnnQuantizer with Static Quantization leads to a significant loss.

justin-Kor commented 3 weeks ago

Dear @cccclai Dear @crinex

I converted the Llama-3.2-3B-Instruct model to a QNN backend using the spinquant method, but the results are not good, so I am inquiring.

First, I created a SpinQuant Rotation Matrix as shown below.

bash scripts/31_optimize_rotation_executorch.sh meta-llama/Llama-3.2-3B-Instruct

Then, I used the generated R.bin and exported it to a PTE file.

python -m examples.models.llama.export_llama \ -t "/root/devandroid/Llama3.2-3B/Llama-3.2-3B-Instruct/original/tokenizer.model" \ -p "/root/devandroid/Llama3.2-3B/Llama-3.2-3B-Instruct/original/params.json" \ -c "/root/devandroid/Llama3.2-3B/Llama-3.2-3B-Instruct/original/consolidated.00.pth" \ -kv \ --qnn \ --pt2e_quantize qnn_16a4w \ --disable_dynamic_shape \ --num_sharding 4 \ --output_name "llama3_2_qnn_spin.pte" \ --calibration_tasks wikitext \ --calibration_limit 1 \ --calibration_seq_length 128 \ --optimized_rotation_path "/root/devandroid/R.bin" \ --calibration_data "<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

When checking the operation, if i ask in Korean, you will get a strange answer only in English, and there will be a prompt in the answer.

Is it correct to convert to the above steps? And did the same thing happen to you?

HSANGLEE commented 3 weeks ago

Dear @justin-Kor

As far as I know, Original LLaMA 3B Instruct model does not support korean.

I think it's normal for the answers to be strange If you don't train or tune anything for korean.

you can simply test results for original model @ Huggingface (using inference API, without Quantization.) I think that this is a reference. https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct

justin-Kor commented 3 weeks ago

Dear @HSANGLEE

Thank you for your answer.

However, when I tested it by using the quantization model distributed by Meta and converting it to an ExecuTorch model, it works normally. https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct-SpinQuant_INT4_EO8 https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct-QLORA_INT4_EO8

So I want to make sure I'm doing SpinQuant wrong.

crinex commented 3 weeks ago

Dear @justin-Kor

Did you succeed in converting the model to QNN using Meta's SpinQuant INT4? Could you tell me which command you used? When I tried, the conversion failed due to a QNN operation error. 빡세네요...

Thank you

justin-Kor commented 3 weeks ago

Dear @crinex

I also had the QNN conversion fail due to an error. (expect the guide to be released soon. ㅜㅡ) The result mentioned above is the XNNPACK target.

HSANGLEE commented 3 weeks ago

@justin-Kor

Oh, I understand.

But, It will be better to ask the community or author about as you referenced.

HSANGLEE commented 3 weeks ago

Dear @justin-Kor

If the result as you showed "korean" is from this model (QLORA, https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct-QLORA_INT4_EO8), it's reasonable.

But if you're using SpinQuant Model(https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct-SpinQuant_INT4_EO8), it does not make sense. (SPINQUANT is just ptq).

I hope that it will help you understanding.

justin-Kor commented 3 weeks ago

Dear @HSANGLEE

It's weird for me, but both spinquant and qlora models work fine for Korean questions. Also, in addition to the quantization model distributed by Meta, it worked normally for Korean questions when converting without quantization according to the executorch guide.

j0h0k0i0m commented 2 days ago

Dear @justin-Kor

Sorry for commenting out of the blue. I'm curious if the generation works well when using QNN. Also, how is the performance of Korean text generation? Could you share your thoughts?

justin-Kor commented 2 days ago

Dear @j0h0k0i0m,

If i run a quantized model with QNN via SpinQuant, the English answer looks fine, but the Korean answer is strange.

Please note that there is a similar issue and I will share it with you https://github.com/pytorch/executorch/issues/6584

pytorch / executorch

Error from LLaMA 3.2 3B Instruct Model generation (.pte) #6388

🐛 Describe the bug

Versions