w8a8 Does it require dequantization during forward inference?

mit-han-lab / smoothquant

[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

MIT License

1.27k stars 150 forks source link

Thank you for your interest in our open-source project and for your questions about the reasoning of quantitative models.

Regarding your first question, yes, we do quantize the KV cache in our model. This quantization significantly decreases the inference time, as benchmarked in our paper. You can refer to the relevant benchmark details in our paper. .
For your second question on how to perform all computations on tensor core INT8, which indeed saves memory and improves computation efficiency, we have detailed this process in our implementation. You can find a practical demonstration in our notebook here: SmoothQuant INT8 Demo.

Best, Guangxuan

mit-han-lab / smoothquant