How To Building and Running Llama 3.2 1B Instruct with Qualcomm AI Engine Direct Backend？

baotonghe commented 3 weeks ago

Right Case

When I follow the doc : https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md#enablement, I export the Llama3.2-1B-Instruct:int4-spinquant-eo8 model to xnnpack backend pte successfully, and working alright on cpu.

Bad Case

But as the link: https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md, when I export to the qnn backend using mode Llama3.2-1B-Instruct, I can get the out pte file, but when I make it running on the android device, it not working right.

I export pte file like this:

python -m examples.models.llama.export_llama --checkpoint "${MODEL_DIR}/consolidated.00.pth" -p "${MODEL_DIR}/params.json" -kv --disable_dynamic_shape --qnn --pt2e_quantize qnn_16a4w -d fp32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --soc_model SM8550 --output_name="llama3_2_ptqqnn.pte"

This is the part of output when I export

INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_permute_copy_default_979, aten.permute_copy.default INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_squeeze_copy_dims_175, aten.squeeze_copy.dims INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_add_tensor_79, aten.add.Tensor INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_select_copy_int_512, aten.select_copy.int INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_rms_norm_default_32, aten.rms_norm.default INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_view_copy_default_288, aten.view_copy.default INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_permute_copy_default_980, aten.permute_copy.default INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_convolution_default_112, aten.convolution.default INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_permute_copy_default_981, aten.permute_copy.default INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_view_copy_default_289, aten.view_copy.default INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: quantized_decomposed_dequantize_per_tensor_tensor, quantized_decomposed.dequantize_per_tensor.tensor [INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters [INFO] [Qnn ExecuTorch]: Destroy Qnn context [INFO] [Qnn ExecuTorch]: Destroy Qnn device [INFO] [Qnn ExecuTorch]: Destroy Qnn backend /home/hebaotong/AI/Executorch/executorch_new/executorch/exir/emit/_emitter.py:1512: UserWarning: Mutation on a buffer in the model is detected. ExecuTorch assumes buffers that are mutated in the graph have a meaningless initial state, only the shape and dtype will be serialized. warnings.warn( INFO:root:Required memory for activation in bytes: [0, 17552384] modelname: llama3_2_ptqqnn output_file: llama3_2_ptqqnn.pte INFO:root:Saved exported program to llama3_2_ptqqnn.pte

Screenshot of run status

PTQ_QNN

crinex commented 3 weeks ago

@baotonghe You need to check other QNN issues, as most QNN issues encounter the same error as yours. They(executorch) are not suggesting any countermeasures or alternative methods.

cccclai commented 3 weeks ago

Hi, thank you for trying out the llama model on QNN, since the command you ran didn't include the calibration process, the output likely will be very off. We're still working on quantized 1b model for QNN

crinex commented 3 weeks ago

@cccclai Are you suggesting that the calibration method you're recommending is to use SpinQuant?

baotonghe commented 3 weeks ago

Hi, thank you for trying out the llama model on QNN, since the command you ran didn't include the calibration process, the output likely will be very off. We're still working on quantized 1b model for QNN

@cccclai Thank you for your response, looking forward to your good news.

baotonghe commented 3 weeks ago

@baotonghe You need to check other QNN issues, as most QNN issues encounter the same error as yours. They(executorch) are not suggesting any countermeasures or alternative methods.

@crinex Thank you for the explanation. I've also read many questions, and everyone seems to have similar issues. Let's wait for the official updates and response.

pytorch / executorch

How To Building and Running Llama 3.2 1B Instruct with Qualcomm AI Engine Direct Backend？ #6655

Right Case

Bad Case