spcl / QuaRot

Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.
https://arxiv.org/abs/2404.00456
Apache License 2.0
281 stars 20 forks source link

Questions about reproduction of weight-only quantization. #3

Closed ChenMnZ closed 7 months ago

ChenMnZ commented 7 months ago

Dear Authors,

Thanks for your outstanding work. I like it and have learned a lot from it!

I try to reproduce the weight-only quantization results in Table 5. However, I obtained some results that are inconsistent with your paper.

For example,

I want to know if i am missing some details. Thank you.

ChenMnZ commented 7 months ago

Add --w_clip solve my problem.

Amazing work, so excellent performance!

sashkboos commented 7 months ago

Thanks @ChenMnZ for using our code. Glad to hear that you fixed the issue. I will close this issue

ChenMnZ commented 7 months ago

Hi,

As I have aforementioned before, when taking A16W4 quantization without w_clip, simple RTN obtains 6.11 WikiText2 perplexity. However, rotate + RTN obtains worse results, 6.99 Wikitext2 perplexity.

In my understanding, after rotating(incoherence processing), the distribution of weights should be more uniform and the quantize-ability of weights should be improved. So, what is the potential reasons behind that RTN+rotate achieve worse results than RTN.

Thank you.

sashkboos commented 7 months ago

@ChenMnZ

Please check the results in the paper (check Tables 1 and 5). 6.10 is the case where we have A4W4 with 4-bit KV-caches (in Table 1). However, 6.99 is the A16W4 with FP16 KV-caches (Table 5). Rotation will reduces that number to 6.76 as stated in the paper.

ChenMnZ commented 7 months ago

@sashkboos

I know the reported results. I find that --w_clip bring performance degeneration for RTN. Simply remove --w_clip can boost the RTN perplexity from 6.99 to 6.11.

Specially, some reproduced A16W4 results are as follows:

So, what is the potential reasons behind that rtn +rotate achieve worse results than rtn when without w_clip.

brisker commented 3 months ago

@ChenMnZ @sashkboos I also encounted the same issue. It seems that the hadamard transform will make the weight harder to quantize when direct quantizing with no further improved tricks are applied.

@sashkboos

I know the reported results. I find that --w_clip bring performance degeneration for RTN. Simply remove --w_clip can boost the RTN perplexity from 6.99 to 6.11.

Specially, some reproduced A16W4 results are as follows:

  • w_clip + rotate + rtn: 6.76 (same with paper)
  • rtn + w_clip: 6.99 (same with paper)
  • rtn: 6.11 (remove w_clip can improve the RTN performance)
  • rtn + rotate: 9.52 (why rotation damages the performance of RTN when without w_clip)

So, what is the potential reasons behind that rtn +rotate achieve worse results than rtn when without w_clip.