mit-han-lab / smoothquant

[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
https://arxiv.org/abs/2211.10438
MIT License
1.1k stars 127 forks source link

support auto search for per-layer smoothing alphas, and auto clip for weights, both bits-aware, can do W4A8 with minor loss #65

Closed yyfcc17 closed 7 months ago

yyfcc17 commented 7 months ago
tonylins commented 7 months ago

Thanks for the contribution! Could you please provide some performance comparison w/ and w/o auto search?

yyfcc17 commented 7 months ago

the main improvement is that we don't need to set alpha multiple times, measure the results, then choose the best alpha

the manul setting-testing loop take a lot of time when model becomes large

and also, per-layer alpha seems to be a more resonable solution

on pileval dataset, tested 1000 samples, accuracy measured with last word prediction:

chatglm2-6b W8A8

chatglm2-66b W8A8

the accuracy improvement may be minor, and not tested on other tasks and models (don't have to time to do this right now), the main improvement is auto search instead of manually set.

maybe you can test on your own tasks and models, then decide whether this pr can be merged or not. i think it can be a feature at least, for the user to choose.


update:

chatglm2-6b W4A8

it seems auto search for per-layer alpha and clip value is important under lower bits setting.

i feel W4A8 without loss is within reach. AWQ is a powerful method! 👍