mit-han-lab / qserve

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
Apache License 2.0
401 stars 18 forks source link

Question about dequantization overhead #23

Open DD-DuDa opened 2 months ago

DD-DuDa commented 2 months ago

Thanks for your great work. I want to learn how to calculate the dequantization overhead, like in Figure 18, since the dequantization process is within a single kernel.

image
ys-2020 commented 2 months ago

Hi @DD-DuDa , thank you very much for your interests in QServe. We real-measured the dequant overheads of the above kernels. We compared the actual throughputs of GEMM kernels with dequantization and kernels in which dequantization ops are skipped. The difference of throughputs between the two version of kernels is regarded as dequant overhead.

DD-DuDa commented 2 months ago

Got it! Thank you for your response!

brisker commented 1 month ago

@ys-2020 Screenshot_20240729_194915_com hikvision moa In the formula(5) in your paper, why is the per group scale uint8? Why uint4-uint4 multiplied by uint8 can still be sint8? Is that a typo? This is quite confusing.(In my understanding, the per group scale should be also 4bit to generate a sint8-w)