Closed brisker closed 2 months ago
Hi, could you insert the debug code bitblas.set_log_level("Debug")
in the 3rd line after import torch
and print the log? Thank you.
works on my A100.
python test_issue_131.py
Ref output: tensor([[1654., 1572., 1550., ..., 1519., 1561., 1584.]], device='cuda:0',
dtype=torch.float16)
BitBLAS output: tensor([[1654., 1572., 1550., ..., 1520., 1561., 1585.]], device='cuda:0',
dtype=torch.float16)
@xysmlx @LeiWang1999 a800 gpu and titan-xp gpu are tested. Below I list the logs for both of these two gpus.
error on A800:
root@train-xxxx-5-0:/data1/speed_test# python new_bitblas_test.py
TVM target not found. Please set the TVM target environment variable using `export TVM_TARGET=<target>`, where <target> is one of the available targets can be found in the output of `tools/get_available_targets.py`.
2024-08-05 20:09:25 [BitBLAS:WARNING]: TVM target not found. Please set the TVM target environment variable using `export TVM_TARGET=<target>`, where <target> is one of the available targets can be found in the output of `tools/get_available_targets.py`.
2024-08-05 20:09:25 [BitBLAS:INFO]: Auto detected target: cuda
2024-08-05 20:09:26 [BitBLAS:DEBUG]: Cannot find the appropriate index map for tensorcore
2024-08-05 20:09:51 [BitBLAS:DEBUG]: Cannot find the appropriate index map for tensorcore
2024-08-05 20:09:53 [BitBLAS:DEBUG]: Apply config {'block': [1], 'thread': [1], 'rstep': [1024], 'reduce_thread': [128], 'vectorize': {'A': 8, 'B_decode': 8}}
2024-08-05 20:09:53 [BitBLAS:DEBUG]: Apply config {'block': [2], 'thread': [2], 'rstep': [1024], 'reduce_thread': [64], 'vectorize': {'A': 8, 'B_decode': 8}}
2024-08-05 20:09:53 [BitBLAS:DEBUG]: Apply config {'block': [8], 'thread': [8], 'rstep': [1024], 'reduce_thread': [16], 'vectorize': {'A': 8, 'B_decode': 8}}
2024-08-05 20:09:53 [BitBLAS:DEBUG]: Apply config {'block': [4], 'thread': [4], 'rstep': [1024], 'reduce_thread': [32], 'vectorize': {'A': 8, 'B_decode': 8}}
2024-08-05 20:09:53 [BitBLAS:DEBUG]: Apply config {'block': [16], 'thread': [16], 'rstep': [512], 'reduce_thread': [8], 'vectorize': {'A': 4, 'B_decode': 8}}
2024-08-05 20:09:53 [BitBLAS:DEBUG]: Apply config {'block': [32], 'thread': [32], 'rstep': [256], 'reduce_thread': [4], 'vectorize': {'A': 2, 'B_decode': 8}}
2024-08-05 20:09:53 [BitBLAS:DEBUG]: Apply config {'block': [64], 'thread': [64], 'rstep': [128], 'reduce_thread': [2], 'vectorize': {'B_decode': 8}}
2024-08-05 20:09:53 [BitBLAS:DEBUG]: Apply config {'block': [128], 'thread': [128], 'rstep': [128], 'vectorize': {'B_decode': 8}}
2024-08-05 20:09:53 [BitBLAS:DEBUG]: Warning: block config [128] is not valid for matmul, skip.
2024-08-05 20:09:53 [BitBLAS:DEBUG]: Warning: block config [128] is not valid for matmul, skip.
WARNING:bitblas.utils.target_detector:TVM target not found. Please set the TVM target environment variable using `export TVM_TARGET=<target>`, where <target> is one of the available targets can be found in the output of `tools/get_available_targets.py`.
WARNING:bitblas.utils.target_detector:TVM target not found. Please set the TVM target environment variable using `export TVM_TARGET=<target>`, where <target> is one of the available targets can be found in the output of `tools/get_available_targets.py`.
WARNING:bitblas.utils.target_detector:TVM target not found. Please set the TVM target environment variable using `export TVM_TARGET=<target>`, where <target> is one of the available targets can be found in the output of `tools/get_available_targets.py`.
WARNING:bitblas.utils.target_detector:TVM target not found. Please set the TVM target environment variable using `export TVM_TARGET=<target>`, where <target> is one of the available targets can be found in the output of `tools/get_available_targets.py`.
WARNING:bitblas.utils.target_detector:TVM target not found. Please set the TVM target environment variable using `export TVM_TARGET=<target>`, where <target> is one of the available targets can be found in the output of `tools/get_available_targets.py`.
WARNING:bitblas.utils.target_detector:TVM target not found. Please set the TVM target environment variable using `export TVM_TARGET=<target>`, where <target> is one of the available targets can be found in the output of `tools/get_available_targets.py`.
WARNING:bitblas.utils.target_detector:TVM target not found. Please set the TVM target environment variable using `export TVM_TARGET=<target>`, where <target> is one of the available targets can be found in the output of `tools/get_available_targets.py`.
WARNING:bitblas.utils.target_detector:TVM target not found. Please set the TVM target environment variable using `export TVM_TARGET=<target>`, where <target> is one of the available targets can be found in the output of `tools/get_available_targets.py`.
2024-08-05 20:09:57 [BitBLAS:DEBUG]: LocalBuilder: An exception occurred Traceback (most recent call last):
File "/opt/python-3.10.12/lib/python3.10/site-packages/bitblas/3rdparty/tvm/python/tvm/exec/popen_worker.py", line 87, in main
result = fn(*args, **kwargs)
File "/opt/python-3.10.12/lib/python3.10/site-packages/bitblas/base/utils.py", line 213, in _build
rt_mod = tvm.build(mod, target=arch.target)
File "/opt/python-3.10.12/lib/python3.10/site-packages/bitblas/3rdparty/tvm/python/tvm/driver/build_module.py", line 297, in build
rt_mod_host = _driver_ffi.tir_to_runtime(annotated_mods, target_host)
File "/opt/python-3.10.12/lib/python3.10/site-packages/bitblas/3rdparty/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 239, in __call__
raise_last_ffi_error()
File "/opt/python-3.10.12/lib/python3.10/site-packages/bitblas/3rdparty/tvm/python/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
raise py_err
ValueError: Traceback (most recent call last):
68: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<tvm::runtime::Module (tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target)>::AssignTypedLambda<tvm::{lambda(tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target)#6}>(tvm::{lambda(tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target)#6}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tvm::runtime::TVMRetValue)
67: tvm::TIRToRuntime(tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target const&)
66: tvm::codegen::Build(tvm::IRModule, tvm::Target)
65: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<tvm::runtime::Module (tvm::IRModule, tvm::Target)>::AssignTypedLambda<tvm::runtime::Module (*)(tvm::IRModule, tvm::Target)>(tvm::runtime::Module (*)(tvm::IRModule, tvm::Target), std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
64: tvm::codegen::BuildCUDA(tvm::IRModule, tvm::Target)
63: tvm::codegen::CodeGenC::AddFunction(tvm::GlobalVar const&, tvm::tir::PrimFunc const&)
62: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::DeclBufferNode const*)
61: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::DeclBufferNode const*)
60: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::DeclBufferNode const*)
59: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::DeclBufferNode const*)
58: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::DeclBufferNode const*)
57: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::DeclBufferNode const*)
56: tvm::codegen::CodeGenCUDA::VisitStmt_(tvm::tir::AttrStmtNode const*)
55: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::AttrStmtNode const*)
54: tvm::tir::StmtFunctor<void (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
53: tvm::codegen::CodeGenCUDA::VisitStmt_(tvm::tir::AllocateNode const*)
52: tvm::tir::StmtFunctor<void (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
51: tvm::codegen::CodeGenCUDA::VisitStmt_(tvm::tir::AllocateNode const*)
50: tvm::tir::StmtFunctor<void (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
49: tvm::codegen::CodeGenCUDA::VisitStmt_(tvm::tir::AllocateNode const*)
48: tvm::tir::StmtFunctor<void (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
47: tvm::codegen::CodeGenCUDA::VisitStmt_(tvm::tir::AttrStmtNode const*)
46: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::AttrStmtNode const*)
45: tvm::tir::StmtFunctor<void (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
44: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::SeqStmtNode const*)
43: tvm::codegen::CodeGenCUDA::VisitStmt_(tvm::tir::ForNode const*)
42: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::ForNode const*)
41: tvm::tir::StmtFunctor<void (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
40: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::SeqStmtNode const*)
39: tvm::codegen::CodeGenCUDA::VisitStmt_(tvm::tir::ForNode const*)
38: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::ForNode const*)
37: tvm::tir::StmtFunctor<void (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
36: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::BufferStoreNode const*)
35: tvm::codegen::CodeGenC::PrintExpr[abi:cxx11](tvm::PrimExpr const&)
34: tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)
33: tvm::codegen::CodeGenC::VisitExpr_(tvm::tir::SubNode const*, std::ostream&)
32: tvm::codegen::CodeGenCUDA::PrintVecBinaryOp(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::DataType, tvm::PrimExpr, tvm::PrimExpr, std::ostream&)
31: tvm::codegen::CodeGenC::PrintExpr[abi:cxx11](tvm::PrimExpr const&)
30: tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)
29: tvm::codegen::CodeGenCUDA::VisitExpr_(tvm::tir::CastNode const*, std::ostream&)
28: tvm::codegen::CodeGenC::PrintExpr[abi:cxx11](tvm::PrimExpr const&)
27: tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)
26: tvm::codegen::CodeGenCUDA::VisitExpr_(tvm::tir::CallNode const*, std::ostream&)
25: tvm::codegen::CodeGenC::VisitExpr_(tvm::tir::CallNode const*, std::ostream&)
24: tvm::codegen::PrintBinaryIntrinsic(tvm::tir::CallNode const*, char const*, std::ostream&, tvm::codegen::CodeGenC*)
23: tvm::codegen::CodeGenCUDA::PrintVecBinaryOp(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::DataType, tvm::PrimExpr, tvm::PrimExpr, std::ostream&)
22: tvm::codegen::CodeGenC::PrintExpr[abi:cxx11](tvm::PrimExpr const&)
21: tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)
20: tvm::codegen::CodeGenCUDA::VisitExpr_(tvm::tir::CallNode const*, std::ostream&)
19: tvm::codegen::CodeGenC::VisitExpr_(tvm::tir::CallNode const*, std::ostream&)
18: tvm::codegen::PrintBinaryIntrinsic(tvm::tir::CallNode const*, char const*, std::ostream&, tvm::codegen::CodeGenC*)
17: tvm::codegen::CodeGenCUDA::PrintVecBinaryOp(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::DataType, tvm::PrimExpr, tvm::PrimExpr, std::ostream&)
16: tvm::codegen::CodeGenC::PrintExpr[abi:cxx11](tvm::PrimExpr const&)
15: tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)
14: tvm::codegen::CodeGenCUDA::VisitExpr_(tvm::tir::CastNode const*, std::ostream&)
13: tvm::codegen::CodeGenC::PrintExpr[abi:cxx11](tvm::PrimExpr const&)
12: tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)
11: tvm::codegen::CodeGenC::VisitExpr_(tvm::tir::BufferLoadNode const*, std::ostream&)
10: tvm::codegen::CodeGenC::PrintExpr[abi:cxx11](tvm::PrimExpr const&)
9: tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)
8: tvm::codegen::CodeGenC::VisitExpr_(tvm::tir::AddNode const*, std::ostream&)
7: tvm::codegen::CodeGenCUDA::PrintVecBinaryOp(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::DataType, tvm::PrimExpr, tvm::PrimExpr, std::ostream&)
6: tvm::codegen::CodeGenC::PrintExpr[abi:cxx11](tvm::PrimExpr const&)
5: tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)
4: tvm::codegen::CodeGenC::VisitExpr_(tvm::tir::DivNode const*, std::ostream&)
3: tvm::codegen::CodeGenCUDA::PrintVecBinaryOp(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::DataType, tvm::PrimExpr, tvm::PrimExpr, std::ostream&)
2: tvm::codegen::CodeGenC::PrintExpr[abi:cxx11](tvm::PrimExpr const&)
1: tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)
0: tvm::codegen::CodeGenCUDA::VisitExpr_(tvm::tir::RampNode const*, std::ostream&)
File "/root/BitBLAS/3rdparty/tvm/src/target/source/codegen_cuda.cc", line 1224
ValueError: Check failed: lanes <= 4 (8 vs. 4) : Ramp of more than 4 lanes is not allowed.
2024-08-05 20:09:57 [BitBLAS:INFO]: Evaluation with config {'block': [1], 'thread': [1], 'rstep': [1024], 'reduce_thread': [128], 'vectorize': {'A': 8, 'B_decode': 8}}
2024-08-05 20:09:57 [BitBLAS:INFO]: Time cost of this config: 0.012 ms
2024-08-05 20:09:57 [BitBLAS:INFO]: Evaluation with config {'block': [2], 'thread': [2], 'rstep': [1024], 'reduce_thread': [64], 'vectorize': {'A': 8, 'B_decode': 8}}
2024-08-05 20:09:57 [BitBLAS:INFO]: Time cost of this config: 0.011 ms
2024-08-05 20:09:57 [BitBLAS:INFO]: Evaluation with config {'block': [8], 'thread': [8], 'rstep': [1024], 'reduce_thread': [16], 'vectorize': {'A': 8, 'B_decode': 8}}
2024-08-05 20:09:57 [BitBLAS:INFO]: Time cost of this config: 0.011 ms
2024-08-05 20:09:57 [BitBLAS:INFO]: Evaluation with config {'block': [4], 'thread': [4], 'rstep': [1024], 'reduce_thread': [32], 'vectorize': {'A': 8, 'B_decode': 8}}
2024-08-05 20:09:57 [BitBLAS:INFO]: Time cost of this config: 0.010 ms
2024-08-05 20:09:57 [BitBLAS:INFO]: Evaluation with config {'block': [16], 'thread': [16], 'rstep': [512], 'reduce_thread': [8], 'vectorize': {'A': 4, 'B_decode': 8}}
2024-08-05 20:09:57 [BitBLAS:INFO]: Time cost of this config: 0.011 ms
2024-08-05 20:09:57 [BitBLAS:INFO]: Evaluation with config {'block': [32], 'thread': [32], 'rstep': [256], 'reduce_thread': [4], 'vectorize': {'A': 2, 'B_decode': 8}}
2024-08-05 20:09:57 [BitBLAS:INFO]: Time cost of this config: 0.016 ms
2024-08-05 20:09:57 [BitBLAS:INFO]: Evaluation with config {'block': [64], 'thread': [64], 'rstep': [128], 'reduce_thread': [2], 'vectorize': {'B_decode': 8}}
2024-08-05 20:09:57 [BitBLAS:INFO]: Time cost of this config: 0.011 ms
Ref output: tensor([[1534., 1538., 1482., ..., 1497., 1506., 1486.]], device='cuda:0',
dtype=torch.float16)
BitBLAS output: tensor([[0., 0., 0., ..., 0., 0., 0.]], device='cuda:0', dtype=torch.float16)
Traceback (most recent call last):
File "/data1/speed_test/new_bitblas_test.py", line 41, in <module>
torch.testing.assert_close(output_tensor, ref_result, rtol=1e-2, atol=1e-0)
File "/opt/python-3.10.12/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1520, in assert_close
raise error_metas[0].to_error(msg)
AssertionError: Tensor-likes are not close!
Mismatched elements: 2048 / 2048 (100.0%)
Greatest absolute difference: 1639.0 at index (0, 1267) (up to 1.0 allowed)
Greatest relative difference: 1.0 at index (0, 0) (up to 0.01 allowed)
root@train-xxxx-5-0:/data1/speed_test#
error on Titan-xp:
root@train-xxxx-4-0:/data1/speed_test# python new_bitblas_test.py
2024-08-05 20:05:57 [BitBLAS:INFO]: Auto detected target: nvidia/nvidia-titan-x
2024-08-05 20:05:57 [BitBLAS:DEBUG]: Cannot find the appropriate index map for tensorcore
/tmp/tmpjy7j7pn4.cu(456): warning #1444-D: function "__shfl_down(__half, unsigned int, int)" (declared at line 1852 of /usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp) was declared deprecated ("__shfl_down() is deprecated in favor of __shfl_down_sync() and may be removed in a future release (Use -Wno-deprecated-declarations to suppress this warning).")
t0[0] = __shfl_down((red_buf0[0]), (16), (32));
^
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
/tmp/tmpjy7j7pn4.cu(458): warning #1444-D: function "__shfl_down(__half, unsigned int, int)" (declared at line 1852 of /usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp) was declared deprecated ("__shfl_down() is deprecated in favor of __shfl_down_sync() and may be removed in a future release (Use -Wno-deprecated-declarations to suppress this warning).")
t0[0] = __shfl_down((red_buf0[0]), (8), (32));
^
/tmp/tmpjy7j7pn4.cu(460): warning #1444-D: function "__shfl_down(__half, unsigned int, int)" (declared at line 1852 of /usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp) was declared deprecated ("__shfl_down() is deprecated in favor of __shfl_down_sync() and may be removed in a future release (Use -Wno-deprecated-declarations to suppress this warning).")
t0[0] = __shfl_down((red_buf0[0]), (4), (32));
^
/tmp/tmpjy7j7pn4.cu(462): warning #1444-D: function "__shfl_down(__half, unsigned int, int)" (declared at line 1852 of /usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp) was declared deprecated ("__shfl_down() is deprecated in favor of __shfl_down_sync() and may be removed in a future release (Use -Wno-deprecated-declarations to suppress this warning).")
t0[0] = __shfl_down((red_buf0[0]), (2), (32));
^
/tmp/tmpjy7j7pn4.cu(464): warning #1444-D: function "__shfl_down(__half, unsigned int, int)" (declared at line 1852 of /usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp) was declared deprecated ("__shfl_down() is deprecated in favor of __shfl_down_sync() and may be removed in a future release (Use -Wno-deprecated-declarations to suppress this warning).")
t0[0] = __shfl_down((red_buf0[0]), (1), (32));
^
/tmp/tmpjy7j7pn4.cu(466): warning #1444-D: function "__shfl(__half, int, int)" (declared at line 1840 of /usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp) was declared deprecated ("__shfl() is deprecated in favor of __shfl_sync() and may be removed in a future release (Use -Wno-deprecated-declarations to suppress this warning).")
red_buf0[0] = __shfl((red_buf0[0]), (0), (32));
^
/tmp/tmpjy7j7pn4.cu(452): warning #550-D: variable "mask" was set but never used
unsigned int mask[1];
^
2024-08-05 20:06:35 [BitBLAS:DEBUG]: Apply config {'block': [4], 'thread': [4], 'rstep': [1024], 'reduce_thread': [32], 'vectorize': {'A': 8, 'B_decode': 8}}
2024-08-05 20:06:35 [BitBLAS:DEBUG]: Apply config {'block': [2], 'thread': [2], 'rstep': [1024], 'reduce_thread': [64], 'vectorize': {'A': 8, 'B_decode': 8}}
2024-08-05 20:06:35 [BitBLAS:DEBUG]: Apply config {'block': [1], 'thread': [1], 'rstep': [1024], 'reduce_thread': [128], 'vectorize': {'A': 8, 'B_decode': 8}}
2024-08-05 20:06:35 [BitBLAS:DEBUG]: Apply config {'block': [8], 'thread': [8], 'rstep': [1024], 'reduce_thread': [16], 'vectorize': {'A': 8, 'B_decode': 8}}
2024-08-05 20:06:35 [BitBLAS:DEBUG]: Apply config {'block': [32], 'thread': [32], 'rstep': [256], 'reduce_thread': [4], 'vectorize': {'A': 2, 'B_decode': 8}}
2024-08-05 20:06:35 [BitBLAS:DEBUG]: Apply config {'block': [16], 'thread': [16], 'rstep': [512], 'reduce_thread': [8], 'vectorize': {'A': 4, 'B_decode': 8}}
2024-08-05 20:06:35 [BitBLAS:DEBUG]: Apply config {'block': [64], 'thread': [64], 'rstep': [128], 'reduce_thread': [2], 'vectorize': {'B_decode': 8}}
2024-08-05 20:06:35 [BitBLAS:DEBUG]: Apply config {'block': [128], 'thread': [128], 'rstep': [128], 'vectorize': {'B_decode': 8}}
2024-08-05 20:06:35 [BitBLAS:DEBUG]: Warning: block config [128] is not valid for matmul, skip.
2024-08-05 20:06:35 [BitBLAS:DEBUG]: Warning: block config [128] is not valid for matmul, skip.
2024-08-05 20:06:44 [BitBLAS:DEBUG]: LocalBuilder: An exception occurred Traceback (most recent call last):
File "/opt/python-3.10.12/lib/python3.10/site-packages/bitblas/3rdparty/tvm/python/tvm/exec/popen_worker.py", line 87, in main
result = fn(*args, **kwargs)
File "/opt/python-3.10.12/lib/python3.10/site-packages/bitblas/base/utils.py", line 213, in _build
rt_mod = tvm.build(mod, target=arch.target)
File "/opt/python-3.10.12/lib/python3.10/site-packages/bitblas/3rdparty/tvm/python/tvm/driver/build_module.py", line 297, in build
rt_mod_host = _driver_ffi.tir_to_runtime(annotated_mods, target_host)
File "/opt/python-3.10.12/lib/python3.10/site-packages/bitblas/3rdparty/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 239, in __call__
raise_last_ffi_error()
File "/opt/python-3.10.12/lib/python3.10/site-packages/bitblas/3rdparty/tvm/python/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
raise py_err
ValueError: Traceback (most recent call last):
68: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<tvm::runtime::Module (tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target)>::AssignTypedLambda<tvm::{lambda(tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target)#6}>(tvm::{lambda(tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target)#6}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tvm::runtime::TVMRetValue)
67: tvm::TIRToRuntime(tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target const&)
66: tvm::codegen::Build(tvm::IRModule, tvm::Target)
65: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<tvm::runtime::Module (tvm::IRModule, tvm::Target)>::AssignTypedLambda<tvm::runtime::Module (*)(tvm::IRModule, tvm::Target)>(tvm::runtime::Module (*)(tvm::IRModule, tvm::Target), std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
64: tvm::codegen::BuildCUDA(tvm::IRModule, tvm::Target)
63: tvm::codegen::CodeGenC::AddFunction(tvm::GlobalVar const&, tvm::tir::PrimFunc const&)
62: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::DeclBufferNode const*)
61: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::DeclBufferNode const*)
60: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::DeclBufferNode const*)
59: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::DeclBufferNode const*)
58: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::DeclBufferNode const*)
57: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::DeclBufferNode const*)
56: tvm::codegen::CodeGenCUDA::VisitStmt_(tvm::tir::AttrStmtNode const*)
55: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::AttrStmtNode const*)
54: tvm::tir::StmtFunctor<void (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
53: tvm::codegen::CodeGenCUDA::VisitStmt_(tvm::tir::AllocateNode const*)
52: tvm::tir::StmtFunctor<void (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
51: tvm::codegen::CodeGenCUDA::VisitStmt_(tvm::tir::AllocateNode const*)
50: tvm::tir::StmtFunctor<void (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
49: tvm::codegen::CodeGenCUDA::VisitStmt_(tvm::tir::AllocateNode const*)
48: tvm::tir::StmtFunctor<void (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
47: tvm::codegen::CodeGenCUDA::VisitStmt_(tvm::tir::AttrStmtNode const*)
46: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::AttrStmtNode const*)
45: tvm::tir::StmtFunctor<void (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
44: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::SeqStmtNode const*)
43: tvm::codegen::CodeGenCUDA::VisitStmt_(tvm::tir::ForNode const*)
42: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::ForNode const*)
41: tvm::tir::StmtFunctor<void (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
40: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::SeqStmtNode const*)
39: tvm::codegen::CodeGenCUDA::VisitStmt_(tvm::tir::ForNode const*)
38: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::ForNode const*)
37: tvm::tir::StmtFunctor<void (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
36: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::BufferStoreNode const*)
35: tvm::codegen::CodeGenC::PrintExpr[abi:cxx11](tvm::PrimExpr const&)
34: tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)
33: tvm::codegen::CodeGenC::VisitExpr_(tvm::tir::SubNode const*, std::ostream&)
32: tvm::codegen::CodeGenCUDA::PrintVecBinaryOp(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::DataType, tvm::PrimExpr, tvm::PrimExpr, std::ostream&)
31: tvm::codegen::CodeGenC::PrintExpr[abi:cxx11](tvm::PrimExpr const&)
30: tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)
29: tvm::codegen::CodeGenCUDA::VisitExpr_(tvm::tir::CastNode const*, std::ostream&)
28: tvm::codegen::CodeGenC::PrintExpr[abi:cxx11](tvm::PrimExpr const&)
27: tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)
26: tvm::codegen::CodeGenCUDA::VisitExpr_(tvm::tir::CallNode const*, std::ostream&)
25: tvm::codegen::CodeGenC::VisitExpr_(tvm::tir::CallNode const*, std::ostream&)
24: tvm::codegen::PrintBinaryIntrinsic(tvm::tir::CallNode const*, char const*, std::ostream&, tvm::codegen::CodeGenC*)
23: tvm::codegen::CodeGenCUDA::PrintVecBinaryOp(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::DataType, tvm::PrimExpr, tvm::PrimExpr, std::ostream&)
22: tvm::codegen::CodeGenC::PrintExpr[abi:cxx11](tvm::PrimExpr const&)
21: tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)
20: tvm::codegen::CodeGenCUDA::VisitExpr_(tvm::tir::CallNode const*, std::ostream&)
19: tvm::codegen::CodeGenC::VisitExpr_(tvm::tir::CallNode const*, std::ostream&)
18: tvm::codegen::PrintBinaryIntrinsic(tvm::tir::CallNode const*, char const*, std::ostream&, tvm::codegen::CodeGenC*)
17: tvm::codegen::CodeGenCUDA::PrintVecBinaryOp(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::DataType, tvm::PrimExpr, tvm::PrimExpr, std::ostream&)
16: tvm::codegen::CodeGenC::PrintExpr[abi:cxx11](tvm::PrimExpr const&)
15: tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)
14: tvm::codegen::CodeGenCUDA::VisitExpr_(tvm::tir::CastNode const*, std::ostream&)
13: tvm::codegen::CodeGenC::PrintExpr[abi:cxx11](tvm::PrimExpr const&)
12: tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)
11: tvm::codegen::CodeGenC::VisitExpr_(tvm::tir::BufferLoadNode const*, std::ostream&)
10: tvm::codegen::CodeGenC::PrintExpr[abi:cxx11](tvm::PrimExpr const&)
9: tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)
8: tvm::codegen::CodeGenC::VisitExpr_(tvm::tir::AddNode const*, std::ostream&)
7: tvm::codegen::CodeGenCUDA::PrintVecBinaryOp(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::DataType, tvm::PrimExpr, tvm::PrimExpr, std::ostream&)
6: tvm::codegen::CodeGenC::PrintExpr[abi:cxx11](tvm::PrimExpr const&)
5: tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)
4: tvm::codegen::CodeGenC::VisitExpr_(tvm::tir::DivNode const*, std::ostream&)
3: tvm::codegen::CodeGenCUDA::PrintVecBinaryOp(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::DataType, tvm::PrimExpr, tvm::PrimExpr, std::ostream&)
2: tvm::codegen::CodeGenC::PrintExpr[abi:cxx11](tvm::PrimExpr const&)
1: tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)
0: tvm::codegen::CodeGenCUDA::VisitExpr_(tvm::tir::RampNode const*, std::ostream&)
File "/root/BitBLAS/3rdparty/tvm/src/target/source/codegen_cuda.cc", line 1224
ValueError: Check failed: lanes <= 4 (8 vs. 4) : Ramp of more than 4 lanes is not allowed.
2024-08-05 20:06:44 [BitBLAS:INFO]: Evaluation with config {'block': [4], 'thread': [4], 'rstep': [1024], 'reduce_thread': [32], 'vectorize': {'A': 8, 'B_decode': 8}}
2024-08-05 20:06:44 [BitBLAS:INFO]: Time cost of this config: 0.136 ms
2024-08-05 20:06:44 [BitBLAS:INFO]: Evaluation with config {'block': [2], 'thread': [2], 'rstep': [1024], 'reduce_thread': [64], 'vectorize': {'A': 8, 'B_decode': 8}}
2024-08-05 20:06:44 [BitBLAS:INFO]: Time cost of this config: 0.109 ms
2024-08-05 20:06:44 [BitBLAS:INFO]: Evaluation with config {'block': [1], 'thread': [1], 'rstep': [1024], 'reduce_thread': [128], 'vectorize': {'A': 8, 'B_decode': 8}}
2024-08-05 20:06:44 [BitBLAS:INFO]: Time cost of this config: 0.096 ms
2024-08-05 20:06:44 [BitBLAS:INFO]: Evaluation with config {'block': [8], 'thread': [8], 'rstep': [1024], 'reduce_thread': [16], 'vectorize': {'A': 8, 'B_decode': 8}}
2024-08-05 20:06:44 [BitBLAS:INFO]: Time cost of this config: 0.107 ms
2024-08-05 20:06:44 [BitBLAS:INFO]: Evaluation with config {'block': [32], 'thread': [32], 'rstep': [256], 'reduce_thread': [4], 'vectorize': {'A': 2, 'B_decode': 8}}
2024-08-05 20:06:44 [BitBLAS:INFO]: Time cost of this config: 0.111 ms
2024-08-05 20:06:44 [BitBLAS:INFO]: Evaluation with config {'block': [16], 'thread': [16], 'rstep': [512], 'reduce_thread': [8], 'vectorize': {'A': 4, 'B_decode': 8}}
2024-08-05 20:06:44 [BitBLAS:INFO]: Time cost of this config: 0.093 ms
2024-08-05 20:06:44 [BitBLAS:INFO]: Evaluation with config {'block': [64], 'thread': [64], 'rstep': [128], 'reduce_thread': [2], 'vectorize': {'B_decode': 8}}
2024-08-05 20:06:44 [BitBLAS:INFO]: Time cost of this config: 0.142 ms
/tmp/tmpsfswnatl.cu(456): warning #1444-D: function "__shfl_down(__half, unsigned int, int)" (declared at line 1852 of /usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp) was declared deprecated ("__shfl_down() is deprecated in favor of __shfl_down_sync() and may be removed in a future release (Use -Wno-deprecated-declarations to suppress this warning).")
t0[0] = __shfl_down((red_buf0[0]), (4), (32));
^
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
/tmp/tmpsfswnatl.cu(458): warning #1444-D: function "__shfl_down(__half, unsigned int, int)" (declared at line 1852 of /usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp) was declared deprecated ("__shfl_down() is deprecated in favor of __shfl_down_sync() and may be removed in a future release (Use -Wno-deprecated-declarations to suppress this warning).")
t0[0] = __shfl_down((red_buf0[0]), (2), (32));
^
/tmp/tmpsfswnatl.cu(460): warning #1444-D: function "__shfl_down(__half, unsigned int, int)" (declared at line 1852 of /usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp) was declared deprecated ("__shfl_down() is deprecated in favor of __shfl_down_sync() and may be removed in a future release (Use -Wno-deprecated-declarations to suppress this warning).")
t0[0] = __shfl_down((red_buf0[0]), (1), (32));
^
/tmp/tmpsfswnatl.cu(462): warning #1444-D: function "__shfl(__half, int, int)" (declared at line 1840 of /usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp) was declared deprecated ("__shfl() is deprecated in favor of __shfl_sync() and may be removed in a future release (Use -Wno-deprecated-declarations to suppress this warning).")
red_buf0[0] = __shfl((red_buf0[0]), ((((int)threadIdx.y) * 8)), (32));
^
/tmp/tmpsfswnatl.cu(452): warning #550-D: variable "mask" was set but never used
unsigned int mask[1];
^
Ref output: tensor([[1565., 1554., 1552., ..., 1550., 1512., 1554.]], device='cuda:0',
dtype=torch.float16)
BitBLAS output: tensor([[0., 0., 0., ..., 0., 0., 0.]], device='cuda:0', dtype=torch.float16)
Traceback (most recent call last):
File "/data1/speed_test/new_bitblas_test.py", line 41, in <module>
torch.testing.assert_close(output_tensor, ref_result, rtol=1e-2, atol=1e-0)
File "/opt/python-3.10.12/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1520, in assert_close
raise error_metas[0].to_error(msg)
AssertionError: Tensor-likes are not close!
Mismatched elements: 2048 / 2048 (100.0%)
Greatest absolute difference: 1661.0 at index (0, 1429) (up to 1.0 allowed)
Greatest relative difference: 1.0 at index (0, 0) (up to 0.01 allowed)
Hi @brisker , which cuda version in your enviroment?
Could you print the terminal output of the command nvcc --version
to show the called cuda version in BitBLAS?
We have a known issue similar to this in CUDA 12.5.
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0
>>> import torch
>>> torch.__version__
'2.1.2+cu121'
@LeiWang1999 @xysmlx I tried cuda_12.4, and the same error
@LeiWang1999 Besides, when I tried cuda 12.1, work well, but if I modify the config into
mm = 256
nn = 2048
kk = 1024
matmul_config = bitblas.MatmulConfig(
M=mm, # M dimension
N=nn, # N dimension
K=kk, # K dimension
A_dtype="int8", # activation A dtype
W_dtype="int4", # weight W dtype
accum_dtype="int32", # accumulation dtype
out_dtype="float32", # output dtype
layout="nt", # matrix layout, "nt" indicates the layout of A is non-transpose and the layout of W is transpose
with_bias=False, # bias
# configs for weight only quantization
group_size=None, # setting for grouped quantization
with_scaling=False, # setting for scaling factor
with_zeros=False, # setting for zeros
zeros_mode=None, # setting for how to calculating zeros
)
why does it gives me:
Traceback (most recent call last):
File "speed_compare.py", line 26, in <module>
matmul = bitblas.Matmul(config=matmul_config)
File "/usr/local/miniconda3/lib/python3.8/site-packages/bitblas/ops/general_matmul/__init__.py", line 243, in __init__
self.dispatch_tir(target, from_database, source_format, enable_tuning)
File "/usr/local/miniconda3/lib/python3.8/site-packages/bitblas/ops/general_matmul/__init__.py", line 294, in dispatch_tir
self.hardware_aware_finetune()
File "/usr/local/miniconda3/lib/python3.8/site-packages/bitblas/ops/operator.py", line 206, in hardware_aware_finetune
self.optimized_func = self.apply_fast_tuning(
File "/usr/local/miniconda3/lib/python3.8/site-packages/bitblas/ops/operator.py", line 178, in apply_fast_tuning
self.pass_context = best.config.pass_context
AttributeError: 'NoneType' object has no attribute 'config'
I am using A800-gpu
@LeiWang1999 Besides, when I tried cuda 12.1, work well, but if I modify the config into
mm = 256 nn = 2048 kk = 1024 matmul_config = bitblas.MatmulConfig( M=mm, # M dimension N=nn, # N dimension K=kk, # K dimension A_dtype="int8", # activation A dtype W_dtype="int4", # weight W dtype accum_dtype="int32", # accumulation dtype out_dtype="float32", # output dtype layout="nt", # matrix layout, "nt" indicates the layout of A is non-transpose and the layout of W is transpose with_bias=False, # bias # configs for weight only quantization group_size=None, # setting for grouped quantization with_scaling=False, # setting for scaling factor with_zeros=False, # setting for zeros zeros_mode=None, # setting for how to calculating zeros )
why does it gives me:
Traceback (most recent call last): File "speed_compare.py", line 26, in <module> matmul = bitblas.Matmul(config=matmul_config) File "/usr/local/miniconda3/lib/python3.8/site-packages/bitblas/ops/general_matmul/__init__.py", line 243, in __init__ self.dispatch_tir(target, from_database, source_format, enable_tuning) File "/usr/local/miniconda3/lib/python3.8/site-packages/bitblas/ops/general_matmul/__init__.py", line 294, in dispatch_tir self.hardware_aware_finetune() File "/usr/local/miniconda3/lib/python3.8/site-packages/bitblas/ops/operator.py", line 206, in hardware_aware_finetune self.optimized_func = self.apply_fast_tuning( File "/usr/local/miniconda3/lib/python3.8/site-packages/bitblas/ops/operator.py", line 178, in apply_fast_tuning self.pass_context = best.config.pass_context AttributeError: 'NoneType' object has no attribute 'config'
I am using A800-gpu
@brisker , the int4xint8 is not fully tested yet, you can use the code below:
matmul_config = bitblas.MatmulConfig(
M=mm, # M dimension
N=nn, # N dimension
K=kk, # K dimension
A_dtype="int8", # activation A dtype
W_dtype="int4", # weight W dtype
accum_dtype="int32", # accumulation dtype
out_dtype="float32", # output dtype
layout="nt", # matrix layout, "nt" indicates the layout of A is non-transpose and the layout of W is transpose
with_bias=False, # bias
# configs for weight only quantization
group_size=None, # setting for grouped quantization
with_scaling=False, # setting for scaling factor
with_zeros=False, # setting for zeros
zeros_mode=None, # setting for how to calculating zeros
fast_decoding=False
)
to disable the fast type conversion, we will fix it soon.
@LeiWang1999
@LeiWang1999 Besides, when I tried cuda 12.1, work well, but if I modify the config into
mm = 256 nn = 2048 kk = 1024 matmul_config = bitblas.MatmulConfig( M=mm, # M dimension N=nn, # N dimension K=kk, # K dimension A_dtype="int8", # activation A dtype W_dtype="int4", # weight W dtype accum_dtype="int32", # accumulation dtype out_dtype="float32", # output dtype layout="nt", # matrix layout, "nt" indicates the layout of A is non-transpose and the layout of W is transpose with_bias=False, # bias # configs for weight only quantization group_size=None, # setting for grouped quantization with_scaling=False, # setting for scaling factor with_zeros=False, # setting for zeros zeros_mode=None, # setting for how to calculating zeros )
why does it gives me:
Traceback (most recent call last): File "speed_compare.py", line 26, in <module> matmul = bitblas.Matmul(config=matmul_config) File "/usr/local/miniconda3/lib/python3.8/site-packages/bitblas/ops/general_matmul/__init__.py", line 243, in __init__ self.dispatch_tir(target, from_database, source_format, enable_tuning) File "/usr/local/miniconda3/lib/python3.8/site-packages/bitblas/ops/general_matmul/__init__.py", line 294, in dispatch_tir self.hardware_aware_finetune() File "/usr/local/miniconda3/lib/python3.8/site-packages/bitblas/ops/operator.py", line 206, in hardware_aware_finetune self.optimized_func = self.apply_fast_tuning( File "/usr/local/miniconda3/lib/python3.8/site-packages/bitblas/ops/operator.py", line 178, in apply_fast_tuning self.pass_context = best.config.pass_context AttributeError: 'NoneType' object has no attribute 'config'
I am using A800-gpu
@brisker , the int4xint8 is not fully tested yet, you can use the ``` matmul_config = bitblas.MatmulConfig( M=mm, # M dimension N=nn, # N dimension K=kk, # K dimension A_dtype="int8", # activation A dtype W_dtype="int4", # weight W dtype accum_dtype="int32", # accumulation dtype out_dtype="float32", # output dtype layout="nt", # matrix layout, "nt" indicates the layout of A is non-transpose and the layout of W is transpose with_bias=False, # bias # configs for weight only quantization group_size=None, # setting for grouped quantization with_scaling=False, # setting for scaling factor with_zeros=False, # setting for zeros zeros_mode=None, # setting for how to calculating zeros fast_decoding=False )
to disable the fast type conversion, we will fix it soon.
will this slower the w4a8 gemm speed?
Besides, cuda 12.3 12.4 12.5
all seem to have similar bugs according to my test.
@LeiWang1999
I just use the w4a8 of yours, and keeping fast_decoding=False
, and the gemm time is 0.12 seconds for
m = 256
n = 2048
k = 1024
and the gemm time is 0.0003 seconds for w4a8 if using this method
Will fast_decoding=False
cause such tremendous difference for w4a8?
@brisker would you mind provide your benchmark scripts?
@LeiWang1999
Besides, in the codes below, if out_dtype="float32"
is modified into out_dtype="float16"
, there is also bugs:
AttributeError: 'NoneType' object has no attribute 'config'
import time
import bitblas
import torch
mm = 256
nn_n = 2048
kk = 1024
act_dtype = "int8"
fast_decoding = False
matmul_config = bitblas.MatmulConfig(
M=mm, # M dimension
N=nn_n, # N dimension
K=kk, # K dimension
A_dtype=act_dtype, # activation A dtype
W_dtype="int4", # weight W dtype
accum_dtype="int32", # accumulation dtype
out_dtype="float32", # output dtype
layout="nt", # matrix layout, "nt" indicates the layout of A is non-transpose and the layout of W is transpose
with_bias=False, # bias
# configs for weight only quantization
group_size=None, # setting for grouped quantization
with_scaling=False, # setting for scaling factor
with_zeros=False, # setting for zeros
zeros_mode=None, # setting for how to calculating zeros
fast_decoding=fast_decoding
)
bitblas_matmul = bitblas.Matmul(config=matmul_config)
with torch.no_grad():
input = torch.Tensor(mm,nn_n).normal_().cuda().half()
quant_input, quant_input_scale,dequant_input = dynamic_quant(input)
scale,scale_extra = get_scale(ori_fc.weight.data,group_size)
qqq_linear.pack(ori_fc, scale, scale_extra)
quant_w, w_scale = w4_quant(ori_fc.weight.data,group_size) # not ok, but after bug is fixed, now is ok
# Create input matrices
input_tensor = quant_input.to(torch.int8).cuda() if act_dtype=="int8" else quant_input.half().cuda()
# weight_tensor = ori_fc.weight.data.to(torch.int8).cuda()
# Transform weight tensor to int4 data type
# import pdb;pdb.set_trace()
weight_tensor_int4 = bitblas_matmul.transform_weight(quant_w.T)
# with torch.no_grad():
# out1 = ori_fc(dequant_input.half())
with torch.no_grad():
time1=time.time()
output_tensor = bitblas_matmul(input_tensor, weight_tensor_int4)
time2=time.time()
print(f"bitblas_matmul_time: {time2-time1}")
# print((out1==out2).sum(),out1.numel())
# print(out1)
# print(out2)
# assert(torch.allclose(out1, out2, atol=1e-3, rtol=1e-3))
@brisker , for small shapes, you should run at least 1k iters and get the average runtime, moreover, torch.cuda.synchronize should be applied before the time2=time.time()
, otherwise the time is indeterminate.you can also use the bitblas_matmul.profile_latency()
to get the kernel performance.
@LeiWang1999
bitblas_matmul.profile_latency()
gives me 0.025
, but compared to 0.0003
, still big difference. I think this difference can not be due to any other reason but the w4a8 implementation itself.
Will fast_decoding=False
cause such tremendous difference for w4a8? And the out_dtype
has to be float32
( otherwise float16 causes error) , this may also has some influence---In other w4a8 pipelines, out_dtype is float16
@brisker the unit of the api profile_latency
is ms
.
@LeiWang1999
with torch.no_grad():
time1=time.time()
out1 = w4a8_qserve_linear.forward(input)
torch.cuda.synchronize()
time2=time.time()
print(f"qserve_linear_time: {time2-time1}")
time3=time.time()
output_tensor = bitblas_matmul(quanti_input_tensor, weight_tensor_int4)
torch.cuda.synchronize()
time4=time.time()
print(f"bitblas_matmul_time: {time4-time3}")
print(f"bitblas_matmul_profile_latency--: {bitblas_matmul.profile_latency()}")
The codes above give me:
qserve_linear_time: 0.0004611015319824219
bitblas_matmul_time: 0.11247825622558594
bitblas_matmul_profile_latency--: 0.0241664
(time4-time3)
a lot bigger than(time2-time1)
is still very weird to me, since torch.cuda.synchronize()
is already added.
Besides, why (time4-time3)
is also a lot bigger than bitblas_matmul.profile_latency()
too? I mean, no matter how we get the latency, (time4-time3)
can always be the standards if applying bitblas to LLM-quantization to accelerate the inference.
@brisker , for benchmarking, it is crucial to ensure that the program runs multiple times to minimize the impact of time measurement errors. Single runs can produce inaccurate results.
@LeiWang1999
I tried multiple runs, from 10 to 500, 5k, 50k,
the results are, below 1k runs, bitblas is consistenly slower, but when comes to 5k, 50k runs, bitblas is faster, with bitblas_matmul.profile_latency()
is also consistenly lower.
But in other benchmark test codes, normally, 10 or 100 runs are enough. Why are your w4a8 ops so sensitive to running times?
I don't know why you point out that the bitblas_matmul.profile_latency()
is also consistenly lower. because as you mentioned, the qserve_linear_time is 0.0004611015319824219 s -> 46 us, while bitblas is 0.0241664 ms -> 24 us?
Moreover, normally, 10 or 100 runs are not enough for my experience when the kernel runtime is under us level.
To make it clear, I plan to test them on real llama2-7B models to test the real w4a8-speed-up performance. Thanks for your patient replies!
Besides, will fast_decoding=False
be harmful for bitblas w4a8? Since currently, fast_decoding=True
has bugs.
@LeiWang1999
Hi @brisker . To accurately test the real speed-up, profile_latency()
performs correctly for our experience. However, I recommend using nsys
or nvprof
or run with multiple times to track the actual kernel runtime as these tools provide more precise and reliable performance metrics for real word workload rather than profiling with time only ones.
I believe fast_decoding=False doesn’t significantly impact int8xint4 for the shapes you’re profiling. However, it’s always better to apply fast_decode for optimal performance.
Hi @brisker , we now support INT8xINT4 Fast Decoding and fixed the compile issues. If you have any further questions about this issue, feel free to follow up this thread!
@LeiWang1999 May I ask which part of the w4a8 pipeline do you call "fast decoding"?
@brisker Sure, the python side, flag_decoding
will enable a tensorization schedule to replace the decode with LOP3
instead of type conversion instructions. https://github.com/microsoft/BitBLAS/blob/60f3e5dedf411361f877de1443b5a596e00d342a/bitblas/gpu/matmul_mma_dequantize.py#L1016
From cuda side, you can observe some device functions start with decode_
@LeiWang1999 We may need a fallback setting when fast_decoding
is not available. This can be implemented by maintaining a list for fast_decoding
-supported data types and automatically setting fast_decoding=False
when current data type is not in the list. It would be preferable not to expose performance flags.
python 3.10 cuda 12.1
I just pip install bitblas and run :
python -c "import bitblas; print(bitblas.__version__)"
, and it gives me:0.0.1.dev13
and I run this basic code:
and the weird thing is that, the running result gives me: