Open brisker opened 11 months ago
The appendix section of the paper states that "quantization of activation can also help reduce the memory cost from storing the KV cache". Referring to the explanation in Figure 6 regarding the application of SmoothQuant to the attention block, it seems that the quantization method used in the paper converts the *X Wk operation into the multiplication of two INT8 matrices, with the resulting K also in INT8 format. Compared to the FP16 format for K**, this approach results in lower memory usage, thereby reducing the memory overhead of the KV cache.
This response is my personal understanding, and I am not sure if it is correct.
Is kv-cache actually not used in all the LLM-evaluation tasks, since those tasks usually takes only one-step attention calculation, not like language generating process which needs a lot of kv-cache since the words need to be generated one by one?
If this is true, how to evaluate the quantization performace if kv-cache needs to be quantized, if we want to quantize LLMs?(since kv-cache is actually not used in normal evaluation tasks)
Besides, how is kv-cache quantized in SmoothQuant?
Hoping to discuss with authors of SmoothQuant !