Is fp8 quantization gemm supported?

microsoft / BitBLAS

BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.

MIT License

240 stars 24 forks source link

Is fp8 quantization gemm supported? #82

Closed sleepwalker2017 closed 1 week ago

sleepwalker2017 commented 2 weeks ago

both input A and B are in fp8, and the output is fp16.

Or a fused one, input A with fp16 dtype and A scale with float32 dtype, B in fp8, the kernel quantize A into fp8 and then invoke fp8 gemm to get fp16 output.

Are these supported? And if yes, is there any benchmark? thank you!

LeiWang1999 commented 2 weeks ago

@sleepwalker2017 thanks for your attention! Currently, we do not support FP8 GEMM with scaling, as FP8 GEMM typically lacks a zero point, so rescaling can be performed as an external kernel to adjust the output. If you wish to perform FP8 GEMM, please refer to https://github.com/microsoft/BitBLAS/blob/main/testing/python/operators/test_general_matmul_fp8.py.

you can also apply scaling in the input by directly editing https://github.com/microsoft/BitBLAS/blob/main/bitblas/ops/impl/matmul_dequantize_impl.py

sleepwalker2017 commented 2 weeks ago

@sleepwalker2017 thanks for your attention! Currently, we do not support FP8 GEMM with scaling, as FP8 GEMM typically lacks a zero point, so rescaling can be performed as an external kernel to adjust the output. If you wish to perform FP8 GEMM, please refer to https://github.com/microsoft/BitBLAS/blob/main/testing/python/operators/test_general_matmul_fp8.py.

you can also apply scaling in the input by directly editing https://github.com/microsoft/BitBLAS/blob/main/bitblas/ops/impl/matmul_dequantize_impl.py

Thank you for the quick reply! I'll try that.