Closed sleepwalker2017 closed 1 week ago
@sleepwalker2017 thanks for your attention! Currently, we do not support FP8 GEMM with scaling, as FP8 GEMM typically lacks a zero point, so rescaling can be performed as an external kernel to adjust the output. If you wish to perform FP8 GEMM, please refer to https://github.com/microsoft/BitBLAS/blob/main/testing/python/operators/test_general_matmul_fp8.py.
you can also apply scaling in the input by directly editing https://github.com/microsoft/BitBLAS/blob/main/bitblas/ops/impl/matmul_dequantize_impl.py
@sleepwalker2017 thanks for your attention! Currently, we do not support FP8 GEMM with scaling, as FP8 GEMM typically lacks a zero point, so rescaling can be performed as an external kernel to adjust the output. If you wish to perform FP8 GEMM, please refer to https://github.com/microsoft/BitBLAS/blob/main/testing/python/operators/test_general_matmul_fp8.py.
you can also apply scaling in the input by directly editing https://github.com/microsoft/BitBLAS/blob/main/bitblas/ops/impl/matmul_dequantize_impl.py
Thank you for the quick reply! I'll try that.
both input A and B are in fp8, and the output is fp16.
Or a fused one, input A with fp16 dtype and A scale with float32 dtype, B in fp8, the kernel quantize A into fp8 and then invoke fp8 gemm to get fp16 output.
Are these supported? And if yes, is there any benchmark? thank you!