mit-han-lab / smoothquant

[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
https://arxiv.org/abs/2211.10438
MIT License
1.27k stars 150 forks source link

w8a8 Does it require dequantization during forward inference? #70

Open shatealaboxiaowang opened 11 months ago

shatealaboxiaowang commented 11 months ago

hi, thank you for your open source. I have a few questions about the reasoning of quantitative models. (1) if for the model with only W8A8 quantization, but kv cache does not quantize, whether the forward inference process must be carried out before the calculation can be carried out, so will the inference time increase? (2)how can we do all the computation on tensor core int8, which can save memory and improve computation efficiency?

Guangxuan-Xiao commented 11 months ago

Thank you for your interest in our open-source project and for your questions about the reasoning of quantitative models.

  1. Regarding your first question, yes, we do quantize the KV cache in our model. This quantization significantly decreases the inference time, as benchmarked in our paper. You can refer to the relevant benchmark details in our paper. flow.

  2. For your second question on how to perform all computations on tensor core INT8, which indeed saves memory and improves computation efficiency, we have detailed this process in our implementation. You can find a practical demonstration in our notebook here: SmoothQuant INT8 Demo.

Best, Guangxuan