[Dev] Fix a but within FP8 E4M3 Fast Decoding

This pull request primarily focuses on enhancing the functionality of the bitblas Python package and updating the version number. The main changes include the addition of MatmulConfigWithSplitK and MatmulWithSplitK in the bitblas module, updates to the gemv and gemv_dequantize modules to support more iterations, and modifications to the quantization module for better handling of floating point numbers. The version number has also been updated from 0.0.1.dev9 to 0.0.1.dev12.

Version Update:

VERSION and python/bitblas/__init__.py: Updated the version number from 0.0.1.dev9 to 0.0.1.dev12. [1] [2]

Enhancements to bitblas module:

python/bitblas/__init__.py: Imported MatmulConfigWithSplitK and MatmulWithSplitK from general_matmul_splitk module.

Updates to gemv and gemv_dequantize modules:

python/bitblas/gpu/gemv.py: Extended the acceptable range of block_info.iters length to include 4.
python/bitblas/gpu/gemv_dequantize.py: Adjusted the logic in get_vectorize_factor to handle cases where the length of sch.get_loops(block_b) is 4. [1] [2]

Modifications to quantization module:

python/bitblas/quantization/quantization.py: Revised _tir_u8_to_f8_e4m3_to_f16 function and added a new function _tir_u8_to_f8_e4m3_to_f16_naive for better handling of floating point numbers.

Other Changes:

python/bitblas/wrapper/general.py: Modified the legalize_c function to handle cases where dynamic_symbolic_set is not empty.
testing/python/operators/test_general_matmul_splitk_ops.py: Added additional calls to matmul.forward for testing purposes.

microsoft / BitBLAS

[Dev] Fix a but within FP8 E4M3 Fast Decoding #54