mobiusml / hqq

Official implementation of Half-Quadratic Quantization (HQQ)
https://mobiusml.github.io/hqq_blog/
Apache License 2.0
680 stars 66 forks source link

Question about the Weight Error #10

Closed rainyBJ closed 7 months ago

rainyBJ commented 9 months ago

Great work! I have a question about the current_error here: https://github1s.com/mobiusml/hqq/blob/master/hqq/core/optimize.py#L30, should we use the p=0.7 instead of p=1 for the l-p norm error? Because all the Half Quadratic Quantization based on is to solve the non-convex problem where p<1. image

mobicham commented 9 months ago

Hi @rainyBJ, thanks for your question!

current_error is only used for early stopping, it's not used in anyway to optimize the zero-point. Technically we optimize for ||x||_p^p, so one can do that and I just tested it, it's basically the same thing as long as p>=0.80 is not too small: https://github.com/mobiusml/hqq/blob/master/hqq/core/optimize.py#L118 The problem here is the instability especially with fp16, if you push p to be lower than 0.80, the difference between the errors becomes noisy and the the optimization breaks before converging. It's better to just keep it this way since it was tested extensively on multiple models and quantization settings.

Hope it helps!

rainyBJ commented 9 months ago

Yeah, I got it, thanks for your reply!

I have another question about the round operation to zeropoint(zp) at https://github1s.com/mobiusml/hqq/blob/master/hqq/core/quantize.py#L60. This round operation is from AWQ, to make 0 -> 0 after fake quantization(quantization + dequantization) according to this issue https://github.com/mit-han-lab/llm-awq/issues/116 which further links to https://github.com/google/gemmlowp/blob/master/doc/quantization.md.

Since we are going to adjust zeropoint for less weight quantization error later, zp is going to be a non-integer value, which makes rounding zp before optimize_weights_proximal meaningless to me. And I see this param is set to False by default, so have you tested how it will affect the final performance of the quantized model?

mobicham commented 9 months ago

That's correct, after optimization the zeropoint can change to negative values especially for lower bits, so the rounding will negatively impact the performance. The reason why there's the rounding option is just because both GPTQ/AWQ have it in their code: https://github.com/PanQiWei/AutoGPTQ/blob/main/auto_gptq/quantization/quantizer.py#L82 By default it's disabled for lower bits, but since it's used in GPTQ/AWQ for 4-bit, it's enabled for 4-bit as well in HQQ. I haven't played with it much but for sure it should not be used after the optimization step.