opengear-project / GEAR

GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
MIT License
143 stars 12 forks source link

Question about LowRank #11

Open shhn1 opened 4 months ago

shhn1 commented 4 months ago

Thanks for your great work!

In the paper, after the kv cache is quantized, a low-rank matrix is ​​used to approximate the quantization error. I really want to know if this process needs training? Since I can't find a usage guide, could you please tell me where the specific usage details in the code are?

HaoKang-Timmy commented 4 months ago

This does not need any training or fintuning. Low rank matrixs are generated by singular value decomposition algorithms.

shhn1 commented 4 months ago

Thanks for your kind reply! Since I just started to learn about quantization recently. May I ask if the quantization error is obtained using an offline calibration dataset? When running inference, will a fixed low-rank approximation matrix be added to the quantized kv cache?

In addition, I saw that there are lowrank-related functions in this script [fake_svd_lowrank](https://github.com/opengear-project/GEAR/blob/b4f14ce6678240a2e7f828d3c4a268d719b5ee7d/GEARLM/GEARLM/Simulated/compress_function.py#L202), but I did not find its use in the llama-related code. Could you tell me how it is used?

I would be very grateful if you could reply! :)

This does not need any training or fintuning. Low rank matrixs are generated by singular value decomposition algorithms.

HaoKang-Timmy commented 3 months ago

Thanks for your kind reply! Since I just started to learn about quantization recently. May I ask if the quantization error is obtained using an offline calibration dataset? When running inference, will a fixed low-rank approximation matrix be added to the quantized kv cache?

In addition, I saw that there are lowrank-related functions in this script [fake_svd_lowrank](https://github.com/opengear-project/GEAR/blob/b4f14ce6678240a2e7f828d3c4a268d719b5ee7d/GEARLM/GEARLM/Simulated/compress_function.py#L202), but I did not find its use in the llama-related code. Could you tell me how it is used?

I would be very grateful if you could reply! :)

This does not need any training or fintuning. Low rank matrixs are generated by singular value decomposition algorithms.

Quantiztization error is calculated during the quantization process.