Closed shuokay closed 5 months ago
Optimizers are only for evaluating the scale and zeropoint during the quantization of weights.
@dacorvo After studying the code for Calibration
, I finally understand your meaning, about the role of the Optimizer and how the calibration is done. However, I still have some other questions about how to calibrate the output of ops. I will start a new issue to describe it.
To calibrate the activations quanto uses absmax + momentum. Other algorithms can be applied. In other frameworks the object doing this is typically called an Observer
.
Just to be clear there is no abstraction in quanto
(yet) to support that. It could be added if it allowed to implement an algorithm with results outperforming absmax + momentum on some models. I personally think that the key is not the calibration itself, but rather the smoothing of activations (i.e. rebalance the activation scales to the weights of the next Linear).
Just to be clear there is no abstraction in
quanto
(yet) to support that. It could be added if it allowed to implement an algorithm with results outperforming absmax + momentum on some models. I personally think that the key is not the calibration itself, but rather the smoothing of activations (i.e. rebalance the activation scales to the weights of the next Linear).
@dacorvo I agree with your point that the key is to smoothing of activations and I have read a paper that proposed a method named AWQ to try smoothing the activation and merge the smooth function to the weight in the before/next layer. However based on my experience, when quantizing convolution based networks, what we are doing is to find outliers and suppress them, that's why we need optimizers like MSE, Percentile, etc. Here is the slides from NVIDIA that may help: https://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf
When quantizing LLMs, we should protect the outliers(https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/), but when quantizing non-LLMs like convolution networks, we need to find outliers and remove/suppress them.
Percentile Optimizer is a commonly used calibration method aimed at activation calibration. We need to make QModuleMixin support the optimizer for activation first