support auto search for per-layer smoothing alphas, and auto clip for weights, both bits-aware, can do W4A8 with minor loss

yyfcc17 commented 7 months ago

add support for auto searching for per-layer smoothing alphas
add support for auto clip for layer weights
both bits-aware when auto searching
tested on chatglm2 6b, 66b, constantly generate the same (or better) result than set manually
takes 2~3 minutes to search for 66b, not very slow, cause we only smooth some linear in the block
these modification works well under W4A8, now i can do W4A8 with chatglm2 with minor loss

tonylins commented 7 months ago

Thanks for the contribution! Could you please provide some performance comparison w/ and w/o auto search?

yyfcc17 commented 7 months ago

the main improvement is that we don't need to set alpha multiple times, measure the results, then choose the best alpha

the manul setting-testing loop take a lot of time when model becomes large

and also, per-layer alpha seems to be a more resonable solution

on pileval dataset, tested 1000 samples, accuracy measured with last word prediction:

settings: W8 per-channel, A8 per-token, simulate with fake quant

chatglm2-6b W8A8

fp16: 0.616
manul: 0.617 // test alpha [0.4, 0.5, 0.6, 0.7], hit 0.6
auto: 0.62 // hit one shot

chatglm2-66b W8A8

fp16: 0.666
manul: 0.66 // test alpha [0.3, 0.4, 0.5, 0.6, 0.7], hit 0.5
auto: 0.664 // hit one shot

the accuracy improvement may be minor, and not tested on other tasks and models (don't have to time to do this right now), the main improvement is auto search instead of manually set.

maybe you can test on your own tasks and models, then decide whether this pr can be merged or not. i think it can be a feature at least, for the user to choose.

update:

chatglm2-6b W4A8

base fp16: 0.616
manul smooth: 0.589 // hit alpha = 0.5
manul smooth + auto clip: 0.594 // alpha = 0.5
auto smooth: 0.595 // hit one shot
auto smooth + auto clip: 0.608 // hit one shot

it seems auto search for per-layer alpha and clip value is important under lower bits setting.

i feel W4A8 without loss is within reach. AWQ is a powerful method! 👍

mit-han-lab / smoothquant

support auto search for per-layer smoothing alphas, and auto clip for weights, both bits-aware, can do W4A8 with minor loss #65