How can I obtain the nearly 4x speedup of of W4A16 matrix-vector computation？

microsoft / BitBLAS

BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.

MIT License

190 stars 21 forks source link

How can I obtain the nearly 4x speedup of of W4A16 matrix-vector computation？ #41

Closed ChenMnZ closed 1 month ago

ChenMnZ commented 1 month ago

Hello,

I use https://github.com/microsoft/BitBLAS/blob/main/benchmark/operators/benchmark_bitblas_matmul.py to benchmark the speed of operations and test on A100-80GB gpu. The obtained results is:

Matmul    1-16384-16384-float16-int4-float16-float16-nt-False-128-False-False-None  0.153 ms    
Matmul    1-16384-16384-float16-float16-float16-float16-nt-False-None-False-False-None  0.297 ms

It seems that W4A16 is 2x faster than W16A16, less than the reported 4x.

I wonder know if I miss something. Thank you!

LeiWang1999 commented 1 month ago

Could you provide your BitBLAS version? In the 0.0.0.dev4 release, we inadvertently disabled some optimizations. You might want to check the matmul.fast_decoding attribute; it's possible that it was set to false, which could cause performance issues.

ChenMnZ commented 1 month ago

My BitBLAS version is 0.0.1.dev5

ChenMnZ commented 1 month ago

I pass the fast_decoding=True to MatMul, and the performance is right now.

Matmul    1-16384-16384-float16-int4-float16-float16-nt-False-128-False-False-None-int8-True  0.083 ms

Thanks!

LeiWang1999 commented 1 month ago

it's weird because the fast_decoding flag should be set to True by default, thanks for your report!

LeiWang1999 commented 1 month ago

@ChenMnZ Thanks for your report again! There indeed exists a typo to disable the fast_decoding by default. We just release the 0.0.0.dev6, we don't need to specify the fast decoding any more.