neuralmagic / sparseml

Libraries for applying sparsification recipes to neural networks with a few lines of code, enabling faster and smaller models
Apache License 2.0
2.07k stars 148 forks source link

Remove QuantizeLinear/DequantizeLinear of ONNX model #2320

Closed hoangtv2000 closed 4 months ago

hoangtv2000 commented 5 months ago

Hi, I trained YOLOv8 model and exported the model to ONNX format by the quantization_recipe below, I set weight_bits=8 and activation_bits=8 to ensure the full-flow inference of quantized model is fixed-point uint8 value. But the QuantizeLinear and DequantizeLinear nodes still exist and they convert the activation to floating-point tensor. I checked the final_recipe.yaml and I saw that my activation_bits setting was disabled. I wonder if there's any way to get rid these nodes but still preserve model performance or another recipe setting to make model fully-integer inference? Thanks,

version: 1.1.0

# General variables
num_epochs: 20
init_lr: 1.e-3
final_lr: 1.e-6
lr_func: cyclic

# Quantization variables
qat_start_epoch: 1
observer_freeze_epoch: 3
bn_freeze_epoch: 3

training_modifiers:
  - !EpochRangeModifier
    start_epoch: 1
    end_epoch: eval(num_epochs)

  - !LearningRateFunctionModifier
    start_epoch: eval(qat_start_epoch)
    end_epoch: eval(num_epochs)
    lr_func: cosine
    init_lr: eval(init_lr)
    final_lr: eval(final_lr)

quantization_modifiers:
- !QuantizationModifier
    start_epoch: eval(qat_start_epoch)
    disable_quantization_observer_epoch: eval(observer_freeze_epoch)
    freeze_bn_stats_epoch: eval(bn_freeze_epoch)
    # ignore: ['Upsample', 'Concat']

    # tensorrt: False
    quantize_linear_activations: True
    quantize_conv_activations: True
    # quantize_embedding_activations: True
    # quantize_embeddings: True
    # reduce_range: True
    # exclude_module_types: ['Concat', 'Upsample']
    weight_bits: 8
    activation_bits: 8
    model_fuse_fn_name: conv_bn_relus
    # exclude_batchnorm: True
bfineran commented 5 months ago

hi @hoangtv2000 how are you exporting the model? convert_qat should be set to True so that the Q/DQs will be folded into fully quantize layers

hoangtv2000 commented 5 months ago

Thank you for responding @bfineran, I set the convert_qat variable to True and the conversion seem to fold the Q/DQ in Convolution and Linear, but the Q/DQ can not be folded in some operators such as Add, Concat. Can you provide advice on how to fold Q/DQ operator in Add, Concat,... or ignore quantizing these operators.

My team are developing algorithm for an AI Chip and it requires no storage and computation in floating-point format, floating point numbers like scales are extracted into shift and multiplier, we simulate the floating-point computation by the multiplication in 32 fixed-point bit range and right shift. The theory of this method is mentioned in Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference.

Below are Q/DQs we want to fold or eliminate if needed. Thank you for your help.

Untitled

hoangtv2000 commented 5 months ago

Okay, I see that my problem can be resolved when changing Concat and Add to QLinearConcat and QLinearAdd respectively, but sparseml hasn't support these operators yet.

bfineran commented 5 months ago

Sounds like an interesting project! Yes that is an approach that would work - you could write the conversions on your own (or extend the ONNX transforms we have).

alternatively you could just ingest the Q/DQs in the graph, but skip the DQs when processing (onnxruntime does something similar for their quantized execution from pytorch)

jeanniefinks commented 4 months ago

Hi @hoangtv2000 As there have been no further comments, I will go ahead and close out this issue. Thank you for reaching out! If you have not already, be sure to "star" our GitHub repos.

Best, Jeannie / Neural Magic