Open chewry opened 2 weeks ago
@quic-akinlawo, can you help respond to this?
interested in this also
Hi @chewry, can you share more details about the quantization options you used in AIMET? And also, what was the performance of the simulated model in AIMET (before you used the snpe converter model)?
For your reference, this is the guide to the snpe-dlc-quantizer: https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-2/tools.html?product=1601111740010412#snpe-dlc-quantize
Hi @chewry, can you share more details about the quantization options you used in AIMET? And also, what was the performance of the simulated model in AIMET (before you used the snpe converter model)?
For your reference, this is the guide to the snpe-dlc-quantizer: https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-2/tools.html?product=1601111740010412#snpe-dlc-quantize
Hello, @quic-akinlawo. Thank you for your attention.
I got advice that add --override_params
option in snpe-dlc-quantize
command. It seems to reduce the degradation, but the quantization artifact still exists. I cannot distinguish the artifact comes from quantization or my faulty command.
If there is official guide, it could be very helpful.
model = prepare_model(model)
sim = QuantizationSimModel(
model=model,
quant_scheme=QuantScheme.training_range_learning_with_tf_init,
dummy_input=dummy_input,
default_output_bw=8,
default_param_bw=8,
)
sim.compute_encodings(
forward_pass_callback=pass_calibration_data,
forward_pass_callback_args=use_cuda,
)
sim.model = trainer.train(sim.model)
sim.export(
path=f"./{args.save_folder}/",
filename_prefix=f"{args.model_name}",
dummy_input=dummy_input,
onnx_export_args={
"input_names": ["input_01",],
"output_names": ["output",],
},
# use_embedded_encodings=True,
)
Hi @chewry, have you evaluated the accuracy of the sim.model
object in pytorch? That should give you a good idea of the performance degradation due to the quantization. The exported onnx model itself does not contain any quantization nodes (without use_embedded_encodings=True
at least), which is likely why you see very close to FP performance here.
Hello authors,
Thank you for your excellent work.
I've tried utilizing AIMET to resolve a severe performance degradation issue caused by quantization while using the SNPE library. However, I've encountered the same problem with AIMET. I would like to seek advice or opinions on this matter.
Here is the previous, SNPE workflow.
A trained Torch model (full precision) was converted to ONNX (using torch.onnx.export), converted to DLC (using snpe-onnx-to-dlc) and then quantized (using snpe-dlc-quantize). It works well for simple models (MobileNet), but fails for deeper models. We have confirmed through experiments that maintaining the model's activation at 16-bit also preserves its performance. However, to achieve the speed of a w8a8 model, I decided to apply AIMET. Here is the AIMET -> SNPE workflow.
I tried to distribute the activation of the network by applying CLE (PTQ) or adjust the model and parameters suitable for quantization by applying QAT. Through AIMET QuantizationSimModel export, onnx and encodings files were generated, which were combined in snpe-onnx-to-dlc to create a DLC, and then quantized DLC was created with snpe-dlc-quantize. However, even when AIMET was applied in this way, there was still a problem where the performance was maintained before quantization but significantly decreased after quantization.
In analyzing the cause of this performance degradation, I couldn't find any official guide for converting to DLC after QuantizationSimModel.export. So there is a question about whether this SNPE conversion process works properly. If someone could provide any guidance or opinions on ONNX (or AIMET byproduct) to quantized DLC, it would be greatly appreciated.
Also, I followed the same process as in the AIMET example except for the SNPE step. If there is anything else you would like to point out, I welcome any feedback.