quic / aimet

AIMET is a library that provides advanced quantization and compression techniques for trained neural network models.
https://quic.github.io/aimet-pages/index.html
Other
2.08k stars 373 forks source link

question about the api of "QuantizationSimModel.compute_encodings" #2517

Open xiexiaozheng opened 10 months ago

xiexiaozheng commented 10 months ago

I used AIMET 1.28 based on PyTorch for Quantization-Aware Training (QAT) of my model. The quant_scheme used was 'QuantScheme.post_training_tf,' which I understand to calculate the minmax values of statistical tensors. However, I noticed that the minmax values calculated by 'compute_encodings' for each layer do not match the minmax values calculated for the float32 model output. The range computed by 'compute_encodings' is smaller than the range of the float32 model. Why is this the case?

quic-mangal commented 10 months ago

Hi @xiexiaozheng, you will see a perfect match between min max of tensor for weight parameters but for activations, the tensor contains quantization noise from layers above the module producing that activation tensor. So the tensor itself does not match with the FP32 output. Therefore, it's min max will be different.

xiexiaozheng commented 10 months ago

Hi, @quic-mangal The output of compute_encodings is calculated under the condition that the input of the layer has quantization noise. Where does the initial quantization parameter that compute_encodings needs to calculate the result come from? Is it iteratively generated?

In addition, I have another question. I quantized a model using QAT on Aimet, and after exporting the model and quantization parameters, I found that there was a decrease of about 10 points in accuracy when running on the DSP of Snapdragon 8 Gen 2. I found that it was related to one layer, which is conv+hardswish, the output of the layer on aimet and DSP are different, even with the same input. The quantization parameters for the layer are the same for both platforms. How should I find the reason for the difference in the results of this layer?

quic-mangal commented 10 months ago

Where does the initial quantization parameter that compute_encodings needs to calculate the result come from?

By parameters, I mean weights and bias. Since weights are bias are known tensors, we can take min and max for them directly. We don't need iterations/ initialization

@quic-akinlawo, is it possible for you to comment on the second part of the issue?

xiexiaozheng commented 10 months ago

Where does the initial quantization parameter that compute_encodings needs to calculate the result come from?

By parameters, I mean weights and bias. Since weights are bias are known tensors, we can take min and max for them directly. We don't need iterations/ initialization

@quic-akinlawo, is it possible for you to comment on the second part of the issue?

@quic-mangal Hi, I understand that there's no need to iterate for weights and bias, but what about activations? When calculating activations with quantization noise, where does the initial quantization information come from?

@quic-akinlawo Hi, could you please comment on that issues, very thanks.

quic-akinlawo commented 10 months ago

Hello @xiexiaozheng, I would like to confirm that your comparison is one to one.

Can you clarify this comment "I found that it was related to one layer, which is conv+hardswish", are you seeing a single layer or are there two separate layers in the converted model? What metric are you using to compare the outputs?

xiexiaozheng commented 10 months ago

Hi, @quic-akinlawo , for the firsht question, the hardswish layer is an activation layer, and when the model is running on the DSP, the conv layer and hardswish layer are merged for computation.

for the second question, the similarity measure I am using is cosine similarity, employed to compare the similarity between the output of the simulated model and the output of the model running on the DSP. I take the output of the layer immediately before the target layer in the DSP model as the input to the corresponding layer in the simulated model, ensuring consistency of the input between the two corresponding layer.

quic-mangal commented 10 months ago

@xiexiaozheng, in the compute encoding Fn we do a forward pass wherein we pass data through the model, this data's min max are accumulated to get the min max for the activations

xiexiaozheng commented 10 months ago

@xiexiaozheng, in the compute encoding Fn we do a forward pass wherein we pass data through the model, this data's min max are accumulated to get the min max for the activations

@quic-mangal hi, Could you elaborate on the “accumulated” process? Are the quantization nodes in the forward pass enabled?

quic-mangal commented 10 months ago

Yes, they are enabled. By accumulated I mean that stats are collected by each batch of data and min max are updated based on these stats

xiexiaozheng commented 10 months ago

@quic-mangal ,Hi, since in the forward pass stage, quantization nodes are enabled, where does the initial quantization stats inside the quantization nodes come from?

quic-mangal commented 10 months ago

This is the initial value-

double min = std::numeric_limits<double>::max();
double max = -std::numeric_limits<double>::max();
xiexiaozheng commented 10 months ago

This is the initial value-

double min = std::numeric_limits<double>::max();
double max = -std::numeric_limits<double>::max();

@quic-mangal ok, I see. Thank you for your patient explanation.