Closed hoangtv2000 closed 4 months ago
hi @hoangtv2000 how are you exporting the model? convert_qat
should be set to True so that the Q/DQs will be folded into fully quantize layers
Thank you for responding @bfineran,
I set the convert_qat
variable to True and the conversion seem to fold the Q/DQ in Convolution and Linear, but the Q/DQ can not be folded in some operators such as Add, Concat. Can you provide advice on how to fold Q/DQ operator in Add, Concat,... or ignore quantizing these operators.
My team are developing algorithm for an AI Chip and it requires no storage and computation in floating-point format, floating point numbers like scales are extracted into shift and multiplier, we simulate the floating-point computation by the multiplication in 32 fixed-point bit range and right shift. The theory of this method is mentioned in Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference.
Below are Q/DQs we want to fold or eliminate if needed. Thank you for your help.
Okay, I see that my problem can be resolved when changing Concat and Add to QLinearConcat and QLinearAdd respectively, but sparseml hasn't support these operators yet.
Sounds like an interesting project! Yes that is an approach that would work - you could write the conversions on your own (or extend the ONNX transforms we have).
alternatively you could just ingest the Q/DQs in the graph, but skip the DQs when processing (onnxruntime does something similar for their quantized execution from pytorch)
Hi @hoangtv2000 As there have been no further comments, I will go ahead and close out this issue. Thank you for reaching out! If you have not already, be sure to "star" our GitHub repos.
Best, Jeannie / Neural Magic
Hi, I trained YOLOv8 model and exported the model to ONNX format by the quantization_recipe below, I set weight_bits=8 and activation_bits=8 to ensure the full-flow inference of quantized model is fixed-point uint8 value. But the QuantizeLinear and DequantizeLinear nodes still exist and they convert the activation to floating-point tensor. I checked the final_recipe.yaml and I saw that my activation_bits setting was disabled. I wonder if there's any way to get rid these nodes but still preserve model performance or another recipe setting to make model fully-integer inference? Thanks,