usyd-fsalab / fp6_llm

An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
Apache License 2.0
172 stars 14 forks source link

Has the accuracy and performance been compared with awq? #4

Open dingjingzhen opened 5 months ago

dingjingzhen commented 5 months ago

Has the accuracy and performance been compared with awq int4?

catid commented 4 months ago

Should read their github blog post - It performs much better

Summer-Summer commented 4 months ago

At the time when we were writing the research paper, TensorRT-LLM's W4A16 kernel was faster than AWQ's W4A16 kernel, so we compared our kernel performance with TensorRT-LLM's W4A16 kernel. According to this figure, our FP6 kernel achieves similar performance with fine-grained W4A16 kernel and is slightly slower than coarse-grained W4A16. As for accuracy, coarse-grained W4A16 shows quite bad results. We also found that FP6 quantization is more robust than INT4. Please also refer to this paper for more insights related to model accuracy.