[DEV][TL] Support AMD Matrix Code Implementation

This pull request includes significant changes to the bitblas package, primarily focusing on restructuring the initialization process and improving the layout functions. The most important changes include updating submodule references, refactoring the initialization script, and adding new layout functions.

Submodule Update:

Updated the tvm submodule to a new commit. (3rdparty/tvm)

Initialization Refactor:

Major refactoring of the bitblas/__init__.py file to streamline environment variable setup and module imports. This includes removing redundant code and reorganizing the import statements. [1] [2]

Logging Improvements:

Improved the logging setup by adjusting the formatter and ensuring consistent string formatting. (bitblas/__init__.py)

Import Path Updates:

Updated import paths in several files to reflect the new location of the mma_macro_generator module.
- bitblas/ops/general_matmul/tilelang/dense/matmul_tensorcore.py
- bitblas/ops/general_matmul/tilelang/dense/matmul_tensorcore_s4.py
- bitblas/ops/general_matmul/tilelang/dequantize/finegrained_primitive_tensorcore.py
- bitblas/ops/general_matmul/tilelang/dequantize/finegrained_primitive_tensorcore_s4.py
- bitblas/ops/general_matmul/tilelang/dequantize/ladder_weight_transform_tensorcore.py
- bitblas/ops/general_matmul/tilelang/dequantize/ladder_weight_transform_tensorcore_s4.py
- bitblas/tl/__init__.py

New Layout Functions:

Added new layout functions for shared to local memory mapping in bitblas/tl/base_layout.py and bitblas/tl/mfma_layout.py. These functions facilitate efficient memory access patterns for tensor operations.
- bitblas/tl/base_layout.py
- bitblas/tl/mfma_layout.py

microsoft / BitBLAS

[DEV][TL] Support AMD Matrix Code Implementation #237

Submodule Update:

Initialization Refactor:

Logging Improvements:

Import Path Updates:

New Layout Functions: