huggingface / optimum-quanto

A pytorch quantization backend for optimum
Apache License 2.0
761 stars 55 forks source link

Add Percentile Optimizer #143

Closed shuokay closed 5 months ago

shuokay commented 5 months ago

Percentile Optimizer is a commonly used calibration method aimed at activation calibration. We need to make QModuleMixin support the optimizer for activation first

dacorvo commented 5 months ago

Optimizers are only for evaluating the scale and zeropoint during the quantization of weights.

shuokay commented 5 months ago

@dacorvo After studying the code for Calibration, I finally understand your meaning, about the role of the Optimizer and how the calibration is done. However, I still have some other questions about how to calibrate the output of ops. I will start a new issue to describe it.

dacorvo commented 5 months ago

To calibrate the activations quanto uses absmax + momentum. Other algorithms can be applied. In other frameworks the object doing this is typically called an Observer.

dacorvo commented 5 months ago

Just to be clear there is no abstraction in quanto (yet) to support that. It could be added if it allowed to implement an algorithm with results outperforming absmax + momentum on some models. I personally think that the key is not the calibration itself, but rather the smoothing of activations (i.e. rebalance the activation scales to the weights of the next Linear).

shuokay commented 5 months ago

Just to be clear there is no abstraction in quanto (yet) to support that. It could be added if it allowed to implement an algorithm with results outperforming absmax + momentum on some models. I personally think that the key is not the calibration itself, but rather the smoothing of activations (i.e. rebalance the activation scales to the weights of the next Linear).

@dacorvo I agree with your point that the key is to smoothing of activations and I have read a paper that proposed a method named AWQ to try smoothing the activation and merge the smooth function to the weight in the before/next layer. However based on my experience, when quantizing convolution based networks, what we are doing is to find outliers and suppress them, that's why we need optimizers like MSE, Percentile, etc. Here is the slides from NVIDIA that may help: https://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf

shuokay commented 5 months ago

When quantizing LLMs, we should protect the outliers(https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/), but when quantizing non-LLMs like convolution networks, we need to find outliers and remove/suppress them.