[Dev] Improve General Matmul With Splitk

This pull request includes changes to the bitblas library and its associated tests. The most significant changes include enabling debug output in QuickStart.md, modifying the forward method in python/bitblas/module/__init__.py and python/bitblas/ops/general_matmul_splitk.py, and adjusting the testing scripts testing/python/operators/test_general_matmul_fp8.py and testing/python/operators/test_general_matmul_splitk_ops.py.

Debug output:

docs/QuickStart.md: Enabled debug output in three examples using bitblas.set_debug_level("Debug"). [1] [2] [3]

Codebase modifications:

python/bitblas/module/__init__.py: Modified the forward method to include a stream variable and a stream_handle variable, which is passed to the lib.call method.
python/bitblas/ops/general_matmul_splitk.py: Adjusted the forward method to change the shape of the output tensor, create a new sk_output tensor, and use the torch.sum method to populate the output tensor. [1] [2] [3]

Testing script adjustments:

testing/python/operators/test_general_matmul_fp8.py: Commented out the call to bitblas.testing.main() and added a call to test_matmul_torch_forward_weight_dequantize.
testing/python/operators/test_general_matmul_splitk_ops.py: Made several changes to the test methods, including adding a SplitK parameter, replacing the get_codegen_result method with a comparison of output_bitblas and output_torch, and adding a map_torch_type method to map input types to torch types. [1] [2] [3]

microsoft / BitBLAS

[Dev] Improve General Matmul With Splitk #50