opengear-project / GEAR

GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
MIT License
128 stars 10 forks source link

Question about storage overhead of sparse matrix S #2

Closed ThisisBillhe closed 4 months ago

ThisisBillhe commented 5 months ago

Hi there! Thanks for your excellent work and the open-sourced code.

In Outlier-reduced quantization section of your paper, you mentioned that "such a sparse matrix results in the remaining cache size equivalent to that of 8-bit quantization because of its two index vectors and one value vector in full precision". If I understand correctly, the storage of S can be a heavy burden because there is a large portion of outliers. In this case, why do you store S and further introduce a low-rank matrix L to approximate the residual R=X-(D+S)? Using a low-rank matrix to approximate the S makes more sense to me.

I understand that your method (X=D+L+S) can achieve better performance. It's just the motivation of Low-rank approximation section is a little confusing.

HaoKang-Timmy commented 5 months ago

S is the value and index of corresponding outlier values. You can not use low-rank approximation to approximate that. Also, we show in paper that outlier extraction does not work well for activation. If you look at our paper carefully, you will find that at a very high compression ratio, low-rank approximation is the best compared with sparsity and quantization. Also, quantization error has a decreasing distribution of egin value, which shows that quantization error can be approximated well by using low rank approximation.

ThisisBillhe commented 5 months ago

Hi Hao Kang, it is a pleasure to see alumni from ZJU here.

  1. As referred to Algorithm 1 in the paper, you do use Filter_s to extract outliers S and introduce a low-rank matrix L to approximate the residual R=X-(D+S), right? And the "compress_mode" for TrueCompressFunction should be "gear_batch".
  2. My initial question is that you said "such a sparse matrix results in the remaining cache size equivalent to that of 8-bit quantization because of its two index vectors and one value vector in full precision" in the paper, which means the the storage of S can be a heavy burden. However, you still use S (outlier extraction) and further introduce low-rank approximation. It is not a big deal but is a little confusing.
HaoKang-Timmy commented 5 months ago

Hi, nice to meet you, too!

  1. Yes compress_mode should be gear_batch or gear_tokenwiseQ. Noted that gear is a compression framework. With a proper quantization scheme, we can achieve comparable results with outlier-reduced quantization with gear. We have prove the time breakdown analysis and we found that quantization is the most time-consuming part of our algorithms. The reason is that currently pytorch does not support 4-bit data and we need to squeeze two 4-bit into 8-bit which cause extra computation and we currently do not have kernel for quantization. But still in low memory system you can see true throughputs.
  2. The reason is that quantization and sparsity have coupling effects. Quantization + low rank approximation will not work if we choose uniform quantization as the quantization scheme. Please refer to Table 5 in paper for more details.