[Feature] Enhancing MatmulOps with Splitk Support

This pull request introduces a number of changes across the python/bitblas package in order to improve the functionality of the BitBlas library. The changes include updates to the Rasterization and TensorCoreExtraConfig classes, modifications to the fast_decode_impl method, and the addition of the MatmulWithSplitK class.

Updates to Rasterization and TensorCoreExtraConfig classes:

python/bitblas/base/roller/__init__.py: Imported new Rasterization classes.
python/bitblas/base/roller/hint.py: Added a new method tensorcore_legalization to the TensorCoreExtraConfig class.

Modifications to fast_decode_impl method:

python/bitblas/gpu/intrin/lop3.py: Reformatted the arguments in the get_fast_decode_intrin calls within the fast_decode_impl method for better readability. [1] [2]

Addition of MatmulWithSplitK class:

python/bitblas/ops/general_matmul_splitk.py: Added a new file implementing the MatmulWithSplitK class, which extends the functionality of the Matmul class with the ability to split the K dimension.

Other important changes:

3rdparty/tvm: Updated the subproject commit.
python/bitblas/base/roller/policy/tensorcore.py: Added a call to tensorcore_legalization in the _score method.
python/bitblas/module/__init__.py: Changed the default value of fast_decoding from True to None in the __init__ method.
python/bitblas/ops/general_matmul.py: Removed the OPExecutorCPU class and added a condition to check if fast decoding is supported in the __initialize_fast_decoding method. [1] [2]

microsoft / BitBLAS

[Feature] Enhancing MatmulOps with Splitk Support #48