[DEV][FP8] Improve e4m3 decoding - Githubissues

microsoft / BitBLAS

BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.

MIT License

190 stars 21 forks source link

[DEV][FP8] Improve e4m3 decoding #43

Closed LeiWang1999 closed 1 month ago

LeiWang1999 commented 1 month ago

This pull request primarily focuses on refining the type conversions and adjusting the precision in the testing function. The changes are aimed at improving the efficiency and accuracy of the code.

Here are the key changes:

Type conversion refinement:

python/bitblas/quantization/quantization.py: In the function _tir_u8_to_f8_e4m3_to_f16, the type of the shift operation has been changed from int16 to uint16. Also, the calculation of s_f16 and e_f16 has been modified to use bitwise operations.

Precision adjustment:

testing/python/operators/test_general_matmul_fp8.py: In the function map_torch_type, the relative and absolute tolerances for the torch.testing.assert_close function have been increased from 1e-2 to 1e-1 to adjust the precision of the test.