triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
692 stars 103 forks source link

Is smooth quant slower than int8 weight only quant? #121

Open shiqingzhangCSU opened 11 months ago

shiqingzhangCSU commented 11 months ago

I tested two quantization methods on a 3B model: w8a8 smooth quant and int8 weight-only quant. The following is the efficiency of different optimization methods .I'm a little confused, Is int8 weight only faster than smooth quant? Or maybe I have some bug on my code? image

byshiue commented 11 months ago

What's batch size do you use?

shiqingzhangCSU commented 11 months ago

I test batch size = 1.

byshiue commented 11 months ago

For small batch size, int 8 weight only is expected to be faster than SmoothQuant. So, your results make sense.

You can try larger batch size to check the performance of SmoothQuant.

shiqingzhangCSU commented 11 months ago

Thank you for your response! I will try more tests.