Open jerryzh168 opened 3 months ago
If this is something that we want to add, I can take a stab at integrating it.
If this is something that we want to add, I can take a stab at integrating it.
Thanks @vayuda! I feel it might be useful to integrate at the quantprimitive ops / qmodule level at least, similar to https://github.com/pytorch/ao/issues/533#issuecomment-2243950283, while keeping their special quantization logic (our current `quantize` API does not really generalize to AWQ and seems nontrivial to extend to be able to accommodate it). But in this case, their qmodule also looks a bit complicated, I'm not sure if we can reuse our existing quant primitive ops. Most ideally, we could structure their stuff as preprocessing/postprocessing + reuse our existing quant_primitive ops (maybe with some extensions), and then implement a new layout for their special packing format.
we are still discussing how we can support it, but please feel free to let us know your thoughts on this as well.
Is there already an up to date example of a quantization workflow which uses calibration data to tune certain parameters (the scale factor in the case of AWQ)
Is there already an up to date example of a quantization workflow which uses calibration data to tune certain parameters (the scale factor in the case of AWQ)
I think you could start with https://github.com/pytorch/ao/blob/main/tutorials/calibration_flow/static_quant.py
also I was talking to @HDCharles about this, and it seems sufficient to implement AWQ just for linear and ignore the complicated case that I linked in the awq code for now.
Note: We may be able to fuse equalization_scale to the kernel as well, but our current A16W4 kernel is implemented in tinygemm, so we'd need to modify tinygemm kernels, if we are relying on torch.compile, it would be easy to do.
BTW, in quite a few popular LLMs, such equalization_scale
can be folded into the previous computes like what AutoAWQ is doing, e.g., folded into the previous normalization layer or folded into the weights of previous linear. This can probably be done in the frontend transformation before torch.compile
though.
Note: We may be able to fuse equalization_scale to the kernel as well, but our current A16W4 kernel is implemented in tinygemm, so we'd need to modify tinygemm kernels, if we are relying on torch.compile, it would be easy to do.
BTW, in quite a few popular LLMs, such
equalization_scale
can be folded into the previous computes like what AutoAWQ is doing, e.g., folded into the previous normalization layer or folded into the weights of previous linear. This can probably be done in the frontend transformation beforetorch.compile
though.
yeah, I talked about this in the Turn Input-Weight Equalization to Cross Layer Equalization
section.
we also have some evidence that torch.compile
can just fuse it from smoothquant code: https://github.com/pytorch/ao/blob/afde1755d906ad644e04835675e7856d72c3c87b/torchao/quantization/smoothquant.py#L150-L152
AWQ seems popular: 3000 appearances in huggingface models: (https://huggingface.co/models?sort=trending&search=AWQ), similar to GPTQ. Maybe we can add this to torchao as well.
Overview
At the high level, AWQ tries to scale weight based on some power of average per channel magnitude of activation (Sx^(alpha)) as mentioned in the paper, where Sx is the average magnitude of activation (per-channel).
Implementation in original awq repo
Main things are finding scale and applying scale to weights.
Note: In original awq implementation, the logic of finding scale is a bit complicated, but that's mainly to deal with the separate qkv modules. we could start by just implementing awq for simple linears, and worry about the more complicated model structures later.
For applying the scales, in the original impl, we have to manually specify what is the
prev_module
, we could do the same, or we can symbolic trace the model (to preserve all call_modules) in order to figure out the relationship between different modules programmably.How to implement it in torchao
First, I think we can focus on implementing AWQ for linear module only, we can get the activation stats using observers, and search for alpha parameter based on the output of the quantized linear module as well, we can reuse the existing quant_primitives for affine quantization in torchao.
Step 1. Collecting Observer Stats
In terms of collecting activation stats, we could follow what we did in https://github.com/pytorch/ao/blob/afde1755d906ad644e04835675e7856d72c3c87b/tutorials/calibration_flow/static_quant.py#L19-L35, we can implement a similar
ObservedLinear
with observer (or just a logger) to log the activation(s)we can create a function
insert_awq_observers_
similar to https://github.com/pytorch/ao/blob/afde1755d906ad644e04835675e7856d72c3c87b/tutorials/calibration_flow/static_quant.py#L37Step 2. Integrate with
AffineQuantizedTensor
Calculating per channel scale can happen when we apply quantization to the weights, similar to: https://github.com/pytorch/ao/blob/afde1755d906ad644e04835675e7856d72c3c87b/tutorials/calibration_flow/static_quant.py#L49-L63
As discussed with @vayuda in CUDA_MODE, I think we could implement a new
LayoutType
andAQTLayout
that will scale the weight withequalization_scale
before quantization, and can apply theequalization_scale
tensor to input activation tensor in linear operator. (Note: I think we should call thisequalization_scale
because it's not AWQ only, smoothquant can resue this)In terms of API, we can implement some helper function like https://github.com/pytorch/ao/blob/afde1755d906ad644e04835675e7856d72c3c87b/torchao/quantization/quant_api.py#L363 to support any configurations.
Note: We may be able to fuse
equalization_scale
to the kernel as well, but our current A16W4 kernel is implemented in tinygemm, so we'd need to modify tinygemm kernels, if we are relying on torch.compile, it would be easy to do.Additional Optimizations
Turn Input-Weight Equalization to Cross Layer Equalization
As we can see from original implementation when applying the scale to linear weights, we applied the scale to the current linear weight and the weight of the previous module, this is only applicable if the previous operation satisfies:
see Section 4.1 of https://arxiv.org/pdf/1906.04721 for more details.
But this could be true for many use cases. To safely apply this optimization, we could do a the following:
see https://pytorch.org/docs/stable/fx.html for docs related to
torch.fx
Logistics (Code Location, Test and Benchmarks)
Please create an
awq
folder underhttps://github.com/pytorch/ao/tree/main/torchao/prototype
The flow and layout implementation can be in separate files, e.g. flow.py, layout.py (there might be some missing extension points of AffineQuantizedTensor, but we'll work on these at the same time)For Testing, please create a
test_awq.py
in https://github.com/pytorch/ao/tree/main/test/prototype we can test basic insert_awqobservers flow and also the layout creation etc.For e2e flow demo, please add a
awq.py
in https://github.com/pytorch/ao/tree/main/tutorials/calibration_flow following the static quant example, please show the benchmarking result as well (since we are using optimized kernel) following https://github.com/pytorch/ao/tree/main/torchao/quantization#quantization-flow-exampleLast step is to test this with llama2/llama3 following instructions in https://github.com/pytorch/ao/tree/main/torchao/_models/llama and measure the metrics in https://github.com/pytorch/ao/tree/main/torchao/quantization#benchmarks if you have GPU machines.
References