quic / aimet

AIMET is a library that provides advanced quantization and compression techniques for trained neural network models.
https://quic.github.io/aimet-pages/index.html
Other
2.15k stars 383 forks source link

Asking for a guide on quantization process utilizing SNPE after applying AIMET QAT/PTQ #3438

Open chewry opened 2 weeks ago

chewry commented 2 weeks ago

Hello authors,

Thank you for your excellent work.

I've tried utilizing AIMET to resolve a severe performance degradation issue caused by quantization while using the SNPE library. However, I've encountered the same problem with AIMET. I would like to seek advice or opinions on this matter.

Here is the previous, SNPE workflow.

## SNPE workflow
1. torch.onnx.export(full_precision_torch_model)
2. snpe-onnx-to-dlc --input_network full_precision_onnx_model
3. snpe-dlc-quantize --input_dlc full_precision_dlc_model

A trained Torch model (full precision) was converted to ONNX (using torch.onnx.export), converted to DLC (using snpe-onnx-to-dlc) and then quantized (using snpe-dlc-quantize). It works well for simple models (MobileNet), but fails for deeper models. We have confirmed through experiments that maintaining the model's activation at 16-bit also preserves its performance. However, to achieve the speed of a w8a8 model, I decided to apply AIMET. Here is the AIMET -> SNPE workflow.

## AIMET -> SNPE workflow
1. (QAT) prepare_model -> compute_encodings -> train -> QuantizationSimModel.export(AIMET_sim_model) 
or (PTQ, CLE) prepare_model -> equalize_model -> compute_encodings -> QuantizationSimModel.export(AIMET_sim_model) 
2. snpe-onnx-to-dlc --input_network AIMET_onnx_model --quantization_overrides AIMET.encodings
3. snpe-dlc-quantize --input_dlc AIMET_dlc_model

I tried to distribute the activation of the network by applying CLE (PTQ) or adjust the model and parameters suitable for quantization by applying QAT. Through AIMET QuantizationSimModel export, onnx and encodings files were generated, which were combined in snpe-onnx-to-dlc to create a DLC, and then quantized DLC was created with snpe-dlc-quantize. However, even when AIMET was applied in this way, there was still a problem where the performance was maintained before quantization but significantly decreased after quantization.

In analyzing the cause of this performance degradation, I couldn't find any official guide for converting to DLC after QuantizationSimModel.export. So there is a question about whether this SNPE conversion process works properly. If someone could provide any guidance or opinions on ONNX (or AIMET byproduct) to quantized DLC, it would be greatly appreciated.

Also, I followed the same process as in the AIMET example except for the SNPE step. If there is anything else you would like to point out, I welcome any feedback.

quic-mangal commented 2 weeks ago

@quic-akinlawo, can you help respond to this?

NikilXYZ commented 5 days ago

interested in this also

quic-akinlawo commented 5 days ago

Hi @chewry, can you share more details about the quantization options you used in AIMET? And also, what was the performance of the simulated model in AIMET (before you used the snpe converter model)?

For your reference, this is the guide to the snpe-dlc-quantizer: https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-2/tools.html?product=1601111740010412#snpe-dlc-quantize

chewry commented 3 days ago

Hi @chewry, can you share more details about the quantization options you used in AIMET? And also, what was the performance of the simulated model in AIMET (before you used the snpe converter model)?

For your reference, this is the guide to the snpe-dlc-quantizer: https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-2/tools.html?product=1601111740010412#snpe-dlc-quantize

Hello, @quic-akinlawo. Thank you for your attention.

  1. I breifly write the AIMET quantization code for our model.
  2. When I tested the exported onnx model (after QAT), it shows almost same performance with before QAT.

I got advice that add --override_params option in snpe-dlc-quantize command. It seems to reduce the degradation, but the quantization artifact still exists. I cannot distinguish the artifact comes from quantization or my faulty command. If there is official guide, it could be very helpful.

model = prepare_model(model)

sim = QuantizationSimModel(
    model=model,
    quant_scheme=QuantScheme.training_range_learning_with_tf_init,
    dummy_input=dummy_input,
    default_output_bw=8,
    default_param_bw=8,
)

sim.compute_encodings(
    forward_pass_callback=pass_calibration_data,
    forward_pass_callback_args=use_cuda,
)

sim.model = trainer.train(sim.model)

sim.export(
    path=f"./{args.save_folder}/",
    filename_prefix=f"{args.model_name}",
    dummy_input=dummy_input,
    onnx_export_args={
        "input_names": ["input_01",],
        "output_names": ["output",],
    },
    # use_embedded_encodings=True,
)
quic-mtuttle commented 1 day ago

Hi @chewry, have you evaluated the accuracy of the sim.model object in pytorch? That should give you a good idea of the performance degradation due to the quantization. The exported onnx model itself does not contain any quantization nodes (without use_embedded_encodings=True at least), which is likely why you see very close to FP performance here.