Open dingjingzhen opened 5 months ago
Should read their github blog post - It performs much better
At the time when we were writing the research paper, TensorRT-LLM's W4A16 kernel was faster than AWQ's W4A16 kernel, so we compared our kernel performance with TensorRT-LLM's W4A16 kernel. According to this figure, our FP6 kernel achieves similar performance with fine-grained W4A16 kernel and is slightly slower than coarse-grained W4A16. As for accuracy, coarse-grained W4A16 shows quite bad results. We also found that FP6 quantization is more robust than INT4. Please also refer to this paper for more insights related to model accuracy.
Has the accuracy and performance been compared with awq int4?