Open klxy0304 opened 1 week ago
@klxy0304, BitBLAS uses a straightforward rule to determine whether a GEMM shape should utilize the tensor core, as seen here: [matmul_analysis.py#L669-L670](https://github1s.com/microsoft/BitBLAS/blob/main/bitblas/gpu/matmul_analysis.py#L669-L670).
The rule requires each dimension to be larger than 16 (in your case, the dimension is 8). However, you can still enable it by running:
tensorized_func, tags = get_tensorized_func_and_tags(func, arch.target, allow_gemv=True)
@LeiWang1999 Thank you for the quick reply. When I enable the tensor core utilization, I received a error from tvm as below:
Traceback (most recent call last):
File "/root/workspace/tuning_work/bitblas/bitblas_tuning.py", line 47, in <module>
cpresults, best = apply_and_build(func, configs, arch, parallel_build=True)
File "/usr/local/lib/python3.10/dist-packages/bitblas/base/utils.py", line 293, in apply_and_build
return apply_and_build_parallel(
File "/usr/local/lib/python3.10/dist-packages/bitblas/base/utils.py", line 201, in apply_and_build_parallel
builder = PopenPoolExecutor(max_workers=max_workers, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/bitblas/3rdparty/tvm/python/tvm/contrib/popen_pool.py", line 360, in __init__
self._threadpool = concurrent.futures.ThreadPoolExecutor(max_workers=max_workers)
File "/usr/lib/python3.10/concurrent/futures/thread.py", line 144, in __init__
raise ValueError("max_workers must be greater than 0")
ValueError: max_workers must be greater than 0
Exception ignored in: <function PopenPoolExecutor.__del__ at 0x7fd03871a050>
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/bitblas/3rdparty/tvm/python/tvm/contrib/popen_pool.py", line 374, in __del__
self._lock.acquire()
AttributeError: 'PopenPoolExecutor' object has no attribute '_lock'
could you tell me how to solve this?
Looks like it's a environment related issues, maybe you could try disable parallel_build.
@LeiWang1999 , I tried setting parallel_build=False, and in order to eliminate the original environment problem, I started a new docker container and reinstalled it through "pip install bitblas". But this error still occurs.
@klxy0304 , would you mind append bitblas.set_log_level("Debug")
before M=8
?
@LeiWang1999 sure, after I appended it,the log is:
Traceback (most recent call last):
File "/ossfs/workspace/bitblas_tune.py", line 48, in <module>
cpresults, best = apply_and_build(func, configs, arch, parallel_build=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/tvm-build-venv/lib/python3.11/site-packages/bitblas/base/utils.py", line 293, in apply_and_build
return apply_and_build_parallel(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/tvm-build-venv/lib/python3.11/site-packages/bitblas/base/utils.py", line 201, in apply_and_build_parallel
builder = PopenPoolExecutor(max_workers=max_workers, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/tvm-build-venv/lib/python3.11/site-packages/bitblas/3rdparty/tvm/python/tvm/contrib/popen_pool.py", line 360, in __init__
self._threadpool = concurrent.futures.ThreadPoolExecutor(max_workers=max_workers)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/tvm-build-venv/lib/python3.11/concurrent/futures/thread.py", line 144, in __init__
raise ValueError("max_workers must be greater than 0")
ValueError: max_workers must be greater than 0
Exception ignored in: <function PopenPoolExecutor.__del__ at 0x7fd1d8a96020>
Traceback (most recent call last):
File "/opt/conda/envs/tvm-build-venv/lib/python3.11/site-packages/bitblas/3rdparty/tvm/python/tvm/contrib/popen_pool.py", line 374, in __del__
self._lock.acquire()
^^^^^^^^^^
AttributeError: 'PopenPoolExecutor' object has no attribute '_lock'
@LeiWang1999 I found that the reason is that the judgment check_tile_shape_isvalid in the emit_config interface keeps failing, resulting in max_workers=0. As seen here: (https://github1s.com/microsoft/BitBLAS/blob/main/bitblas/base/roller/policy/default.py#L46-L47) Is this caused by the definition of MatmulNT?
@klxy0304 , I tested on my A100, and the issue seems to be that the value of N is too large, which may cause an overflow (N * K) of the maximum INT32 value.
Hello, I tried to run a fast tuning of GEMM with float16:
But I got results that are not as expected: [BitBLAS] The best latency of top 1 is 11.767 ms [BitBLAS] The best latency of top 20 is 5.987 ms
For comparison, I tuned a single-layer model using TVM's Meta Schedule, with the model structure as nn.Linear(3584, 152064) and a batch size of 8. Below are the tuning log results:
ID | Name | FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done
0 | fused_nn_dense_add | 8721174528 | 1 | 13285.4769 | 656.4442 | 656.4442 | 1535 |
The result is 656 us, I would like to know if I am using the BitBlas tuning method incorrectly.