How does quanto calibrate torch functions?

huggingface / optimum-quanto

A pytorch quantization backend for optimum

Apache License 2.0

746 stars 55 forks source link

How does quanto calibrate torch functions? #152

Closed shuokay closed 4 months ago

shuokay commented 5 months ago

I have learned quanto calibrate ops in module forms by adding module hooks, but how about torch functions like torch.sigmoid, torch.elu, and torch.log etc? I think the output scale of torch.sigmoid could be directly evaluated similarly to quanto's approach with softmax. Additionally, torch.elu might be substituted with torch.nn.ELU. However, I'm uncertain how functions like torch.log, which are unbounded and lack explicit module forms will be calibrated within quanto.

dacorvo commented 5 months ago

It doesn't. Only the inputs/outputs of modules can be calibrated. I had a branch where I managed to calibrate around functions using a DispatchMode but it was a bit overkill since most of the time the next function does not accept quantized inputs, so instead I used it to deactivate the quantization of module outputs when they are not consumed by a compatible operation (streamline mode).

shuokay commented 5 months ago

@dacorvo Can you share the branch? Is it tracking_mode branch here: https://github.com/huggingface/quanto/tree/tracking_mode?

dacorvo commented 5 months ago

I can't find it: it was an earlier version of this, going in a different direction

dacorvo commented 5 months ago

And just to be clear, this is not something I plan to change in the near future: advanced dispatch is only required for activations, and for now the quantization of activations is not really required, since:

only operations using int8 activations can be accelerated,
in most transformers models int8 outputs are almost immediately dequantized because they are used in incompatible operations.

IMHO, only float8 activations are interesting, as they will be compatible with much more operations. But this will be usable only once we can use accelerated float8 operations.

shuokay commented 5 months ago

Hi @dacorvo,

I agree that fp8 is more promising compared to int8. However, fp8 is not generally accessible in the current. If I remember correctly, only Hopper and newer GPUs(H100, etc) support fp8 natively.

What's more, let's consider NPUs (you can take NPU as one type of ASIC which is similar to nvdla(http://nvdla.org/), such as Qualcomm's SA8650), to the best of my knowledge, most of the computation power of NPUs is in int8 dtype, with only a very small amount of float computation power(It is so small that we always avoid using it when building models). Additionally, the performance bottleneck is usually bandwidth for neural networks, especially for Transformers, therefore, we should minimize data transfer as much as possible (i.e., avoid fallback).

In summary:

int8 remains the most important data type until fp8 is widely supported on NPUs
When building models, we should try to only use NPU-supported int8 operators and avoid falling back to float.
We should reduce data transfer due to the bandwidth bottleneck

So I think it is necessary to quantize activations

dacorvo commented 4 months ago

quanto is a pytorch quantization toolkit, producing pytorch quantized models to be deployed on hosts running pytorch. More, it is designed to immediately dequantize tensors whenever it reaches an operation that is incompatible with the underlying storage data type. For integer tensors, this means very few operations indeed (see under tensor/ops).

If you want to target NPUs, you can use quanto to obtain quantized weights, but you will never be able to replace in the pytorch graph the operations that are not compatible with integer arithmetics. There are other tools to do that, one of them being quantizeml, one of my previous projects (closed-source unfortunately).