Open shatealaboxiaowang opened 11 months ago
Thank you for your interest in our open-source project and for your questions about the reasoning of quantitative models.
Regarding your first question, yes, we do quantize the KV cache in our model. This quantization significantly decreases the inference time, as benchmarked in our paper. You can refer to the relevant benchmark details in our paper. .
For your second question on how to perform all computations on tensor core INT8, which indeed saves memory and improves computation efficiency, we have detailed this process in our implementation. You can find a practical demonstration in our notebook here: SmoothQuant INT8 Demo.
Best, Guangxuan
hi, thank you for your open source. I have a few questions about the reasoning of quantitative models. (1) if for the model with only W8A8 quantization, but kv cache does not quantize, whether the forward inference process must be carried out before the calculation can be carried out, so will the inference time increase? (2)how can we do all the computation on tensor core int8, which can save memory and improve computation efficiency?