Open baotonghe opened 1 week ago
@baotonghe You need to check other QNN issues, as most QNN issues encounter the same error as yours. They(executorch) are not suggesting any countermeasures or alternative methods.
Hi, thank you for trying out the llama model on QNN, since the command you ran didn't include the calibration process, the output likely will be very off. We're still working on quantized 1b model for QNN
@cccclai Are you suggesting that the calibration method you're recommending is to use SpinQuant?
Hi, thank you for trying out the llama model on QNN, since the command you ran didn't include the calibration process, the output likely will be very off. We're still working on quantized 1b model for QNN
@cccclai Thank you for your response, looking forward to your good news.
@baotonghe You need to check other QNN issues, as most QNN issues encounter the same error as yours. They(executorch) are not suggesting any countermeasures or alternative methods.
@crinex Thank you for the explanation. I've also read many questions, and everyone seems to have similar issues. Let's wait for the official updates and response.
Right Case
When I follow the doc : https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md#enablement, I export the Llama3.2-1B-Instruct:int4-spinquant-eo8 model to xnnpack backend pte successfully, and working alright on cpu.
Bad Case
But as the link: https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md, when I export to the qnn backend using mode Llama3.2-1B-Instruct, I can get the out pte file, but when I make it running on the android device, it not working right.
I export pte file like this:
python -m examples.models.llama.export_llama --checkpoint "${MODEL_DIR}/consolidated.00.pth" -p "${MODEL_DIR}/params.json" -kv --disable_dynamic_shape --qnn --pt2e_quantize qnn_16a4w -d fp32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --soc_model SM8550 --output_name="llama3_2_ptqqnn.pte"
This is the part of output when I export
INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_permute_copy_default_979, aten.permute_copy.default INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_squeeze_copy_dims_175, aten.squeeze_copy.dims INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_add_tensor_79, aten.add.Tensor INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_select_copy_int_512, aten.select_copy.int INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_rms_norm_default_32, aten.rms_norm.default INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_view_copy_default_288, aten.view_copy.default INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_permute_copy_default_980, aten.permute_copy.default INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_convolution_default_112, aten.convolution.default INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_permute_copy_default_981, aten.permute_copy.default INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_view_copy_default_289, aten.view_copy.default INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: quantized_decomposed_dequantize_per_tensor_tensor, quantized_decomposed.dequantize_per_tensor.tensor [INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters [INFO] [Qnn ExecuTorch]: Destroy Qnn context [INFO] [Qnn ExecuTorch]: Destroy Qnn device [INFO] [Qnn ExecuTorch]: Destroy Qnn backend /home/hebaotong/AI/Executorch/executorch_new/executorch/exir/emit/_emitter.py:1512: UserWarning: Mutation on a buffer in the model is detected. ExecuTorch assumes buffers that are mutated in the graph have a meaningless initial state, only the shape and dtype will be serialized. warnings.warn( INFO:root:Required memory for activation in bytes: [0, 17552384] modelname: llama3_2_ptqqnn output_file: llama3_2_ptqqnn.pte INFO:root:Saved exported program to llama3_2_ptqqnn.pte
Screenshot of run status