microsoft / BitBLAS

BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
MIT License
341 stars 29 forks source link

performance of float16 with fast tuning #173

Open klxy0304 opened 1 week ago

klxy0304 commented 1 week ago

Hello, I tried to run a fast tuning of GEMM with float16:

from bitblas.base.roller.policy import TensorCorePolicy, DefaultPolicy
from bitblas.base.arch import CUDA
from bitblas.base.utils import apply_and_build
import tvm
from tvm.script import tir as T

M = 8
N = 152064
K = 3584
@tvm.script.ir_module
class MatmulNT:
    @T.prim_func
    def main(a: T.handle, b: T.handle, c: T.handle):
        T.func_attr({"global_symbol": "main", "tir.noalias": True})
        A = T.match_buffer(a, [M, K], dtype="float16")
        B = T.match_buffer(b, [N, K], dtype="float16")
        C = T.match_buffer(c, [M, N], dtype="float16")

        for i, j, k in T.grid(M, N, K):
            with T.block("B"):
                vi, vj, vk = T.axis.remap("SSR", [i, j, k])
                with T.init():
                    C[vi, vj] = tvm.tir.const(0, "float16")
                C[vi, vj] = C[vi, vj] + A[vi, vk].astype("float16") * B[
                    vj, vk
                ].astype("float16") 

ir_module = MatmulNT
func = ir_module["main"]
target = tvm.target.Target("nvidia/nvidia-a100")
arch = CUDA(target)
# Tune with SIMT Cuda Core
policy = DefaultPolicy(func=func, arch=arch)
try:
    tensorized_func, tags = get_tensorized_func_and_tags(func, arch.target)
except Exception:
    tags = None
# Tune with Tensor Core if possible
if tags:
    policy = TensorCorePolicy(func=tensorized_func, arch=arch, tags=tags)

configs = policy.emit_config(topk=20)

cpresults, best = apply_and_build(func, configs, arch, parallel_build=True)

print("[BitBLAS] The best latency of top 1 is {:.3f} ms".format(cpresults[0].latency))
print("[BitBLAS] The best latency of top 20 is {:.3f} ms".format(best.latency))

But I got results that are not as expected: [BitBLAS] The best latency of top 1 is 11.767 ms [BitBLAS] The best latency of top 20 is 5.987 ms

For comparison, I tuned a single-layer model using TVM's Meta Schedule, with the model structure as nn.Linear(3584, 152064) and a batch size of 8. Below are the tuning log results:

ID | Name | FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done

0 | fused_nn_dense_add | 8721174528 | 1 | 13285.4769 | 656.4442 | 656.4442 | 1535 |

The result is 656 us, I would like to know if I am using the BitBlas tuning method incorrectly.

LeiWang1999 commented 1 week ago

@klxy0304, BitBLAS uses a straightforward rule to determine whether a GEMM shape should utilize the tensor core, as seen here: [matmul_analysis.py#L669-L670](https://github1s.com/microsoft/BitBLAS/blob/main/bitblas/gpu/matmul_analysis.py#L669-L670).

The rule requires each dimension to be larger than 16 (in your case, the dimension is 8). However, you can still enable it by running:

tensorized_func, tags = get_tensorized_func_and_tags(func, arch.target, allow_gemv=True)
klxy0304 commented 1 week ago

@LeiWang1999 Thank you for the quick reply. When I enable the tensor core utilization, I received a error from tvm as below:

Traceback (most recent call last):
  File "/root/workspace/tuning_work/bitblas/bitblas_tuning.py", line 47, in <module>
    cpresults, best = apply_and_build(func, configs, arch, parallel_build=True)
  File "/usr/local/lib/python3.10/dist-packages/bitblas/base/utils.py", line 293, in apply_and_build
    return apply_and_build_parallel(
  File "/usr/local/lib/python3.10/dist-packages/bitblas/base/utils.py", line 201, in apply_and_build_parallel
    builder = PopenPoolExecutor(max_workers=max_workers, timeout=timeout)
  File "/usr/local/lib/python3.10/dist-packages/bitblas/3rdparty/tvm/python/tvm/contrib/popen_pool.py", line 360, in __init__
    self._threadpool = concurrent.futures.ThreadPoolExecutor(max_workers=max_workers)
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 144, in __init__
    raise ValueError("max_workers must be greater than 0")
ValueError: max_workers must be greater than 0
Exception ignored in: <function PopenPoolExecutor.__del__ at 0x7fd03871a050>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/bitblas/3rdparty/tvm/python/tvm/contrib/popen_pool.py", line 374, in __del__
    self._lock.acquire()
AttributeError: 'PopenPoolExecutor' object has no attribute '_lock'

could you tell me how to solve this?

LeiWang1999 commented 1 week ago

Looks like it's a environment related issues, maybe you could try disable parallel_build.

klxy0304 commented 1 week ago

@LeiWang1999 , I tried setting parallel_build=False, and in order to eliminate the original environment problem, I started a new docker container and reinstalled it through "pip install bitblas". But this error still occurs.

LeiWang1999 commented 1 week ago

@klxy0304 , would you mind append bitblas.set_log_level("Debug") before M=8?

klxy0304 commented 1 week ago

@LeiWang1999 sure, after I appended it,the log is:

Traceback (most recent call last):
  File "/ossfs/workspace/bitblas_tune.py", line 48, in <module>
    cpresults, best = apply_and_build(func, configs, arch, parallel_build=False)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/tvm-build-venv/lib/python3.11/site-packages/bitblas/base/utils.py", line 293, in apply_and_build
    return apply_and_build_parallel(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/tvm-build-venv/lib/python3.11/site-packages/bitblas/base/utils.py", line 201, in apply_and_build_parallel
    builder = PopenPoolExecutor(max_workers=max_workers, timeout=timeout)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/tvm-build-venv/lib/python3.11/site-packages/bitblas/3rdparty/tvm/python/tvm/contrib/popen_pool.py", line 360, in __init__
    self._threadpool = concurrent.futures.ThreadPoolExecutor(max_workers=max_workers)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/tvm-build-venv/lib/python3.11/concurrent/futures/thread.py", line 144, in __init__
    raise ValueError("max_workers must be greater than 0")
ValueError: max_workers must be greater than 0
Exception ignored in: <function PopenPoolExecutor.__del__ at 0x7fd1d8a96020>
Traceback (most recent call last):
  File "/opt/conda/envs/tvm-build-venv/lib/python3.11/site-packages/bitblas/3rdparty/tvm/python/tvm/contrib/popen_pool.py", line 374, in __del__
    self._lock.acquire()
    ^^^^^^^^^^
AttributeError: 'PopenPoolExecutor' object has no attribute '_lock'
klxy0304 commented 1 week ago

@LeiWang1999 I found that the reason is that the judgment check_tile_shape_isvalid in the emit_config interface keeps failing, resulting in max_workers=0. As seen here: (https://github1s.com/microsoft/BitBLAS/blob/main/bitblas/base/roller/policy/default.py#L46-L47) Is this caused by the definition of MatmulNT?

LeiWang1999 commented 1 week ago

@klxy0304 , I tested on my A100, and the issue seems to be that the value of N is too large, which may cause an overflow (N * K) of the maximum INT32 value.