Questions about reproduction of weight-only quantization.

spcl / QuaRot

Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.

https://arxiv.org/abs/2404.00456

Apache License 2.0

281 stars 20 forks source link

Questions about reproduction of weight-only quantization. #3

Closed ChenMnZ closed 7 months ago

ChenMnZ commented 7 months ago

Dear Authors,

Thanks for your outstanding work. I like it and have learned a lot from it!

I try to reproduce the weight-only quantization results in Table 5. However, I obtained some results that are inconsistent with your paper.

For example,

For A16W4 with RTN quantization, I ran the following command. The obtained WikiText-2 perplexity is 6.11, while 6.99 in your paper.:
```
python main.py --model meta-llama/Llama-2-7b-hf  --a_bits 16 --v_bits 16 --k_bits 16 --w_bits 4 --w_rtn --w_asym
```
For A16W4 with QuaRot-GPTQ, I ran the following command. The obtained WikiText-2 perplexity is 5.72, while 5.60 in your paper.:
```
python main.py --model meta-llama/Llama-2-7b --rotate  --a_bits 16 --v_bits 16 --k_bits 16 --w_bits 4 --w_asym
```
For A16W3 with QuaRot-GPTQ, I ran the following command. The obtained WikiText-2 perplexity is 7.19 , while 6.09 in your paper.:
```
python main.py --model meta-llama/Llama-2-7b --rotate  --a_bits 16 --v_bits 16 --k_bits 16 --w_bits 3--w_asym
```

I want to know if i am missing some details. Thank you.

ChenMnZ commented 7 months ago

Add --w_clip solve my problem.

Amazing work, so excellent performance！

sashkboos commented 7 months ago

Thanks @ChenMnZ for using our code. Glad to hear that you fixed the issue. I will close this issue

ChenMnZ commented 7 months ago

Hi,

As I have aforementioned before, when taking A16W4 quantization without w_clip, simple RTN obtains 6.11 WikiText2 perplexity. However, rotate + RTN obtains worse results, 6.99 Wikitext2 perplexity.

In my understanding, after rotating(incoherence processing), the distribution of weights should be more uniform and the quantize-ability of weights should be improved. So, what is the potential reasons behind that RTN+rotate achieve worse results than RTN.

Thank you.

sashkboos commented 7 months ago

@ChenMnZ

Please check the results in the paper (check Tables 1 and 5). 6.10 is the case where we have A4W4 with 4-bit KV-caches (in Table 1). However, 6.99 is the A16W4 with FP16 KV-caches (Table 5). Rotation will reduces that number to 6.76 as stated in the paper.

ChenMnZ commented 7 months ago

@sashkboos

I know the reported results. I find that --w_clip bring performance degeneration for RTN. Simply remove --w_clip can boost the RTN perplexity from 6.99 to 6.11.

Specially, some reproduced A16W4 results are as follows:

w_clip + rotate + rtn: 6.76 (same with paper)
rtn + w_clip: 6.99 (same with paper)
rtn: 6.11 (remove w_clip can improve the RTN performance)
rtn + rotate: 9.52 (why rotation damages the performance of RTN when without w_clip)

So, what is the potential reasons behind that rtn +rotate achieve worse results than rtn when without w_clip.

brisker commented 3 months ago

@ChenMnZ @sashkboos I also encounted the same issue. It seems that the hadamard transform will make the weight harder to quantize when direct quantizing with no further improved tricks are applied.

@sashkboos

I know the reported results. I find that --w_clip bring performance degeneration for RTN. Simply remove --w_clip can boost the RTN perplexity from 6.99 to 6.11.

Specially, some reproduced A16W4 results are as follows:

w_clip + rotate + rtn: 6.76 (same with paper)

rtn + w_clip: 6.99 (same with paper)

rtn: 6.11 (remove w_clip can improve the RTN performance)

rtn + rotate: 9.52 (why rotation damages the performance of RTN when without w_clip)

So, what is the potential reasons behind that rtn +rotate achieve worse results than rtn when without w_clip.