microsoft / BitBLAS

BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
MIT License
361 stars 29 forks source link

running issues #131

Closed brisker closed 1 month ago

brisker commented 1 month ago

python 3.10 cuda 12.1

I just pip install bitblas and run :python -c "import bitblas; print(bitblas.__version__)", and it gives me: 0.0.1.dev13

and I run this basic code:

import bitblas
import torch

matmul_config = bitblas.MatmulConfig(
    M=1,  # M dimension
    N=2048,  # N dimension
    K=1024,  # K dimension
    A_dtype="float16",  # activation A dtype
    W_dtype="int4",  # weight W dtype
    accum_dtype="float16",  # accumulation dtype
    out_dtype="float16",  # output dtype
    layout="nt",  # matrix layout, "nt" indicates the layout of A is non-transpose and the layout of W is transpose
    with_bias=False,  # bias
    # configs for weight only quantization
    group_size=None,  # setting for grouped quantization
    with_scaling=False,  # setting for scaling factor
    with_zeros=False,  # setting for zeros
    zeros_mode=None,  # setting for how to calculating zeros
)

matmul = bitblas.Matmul(config=matmul_config)

input_tensor = torch.rand((1, 1024), dtype=torch.float16).cuda()
weight_tensor = torch.randint(0, 7, (2048, 1024), dtype=torch.int8).cuda()

weight_tensor_int4 = matmul.transform_weight(weight_tensor)

output_tensor = matmul(input_tensor, weight_tensor_int4)

ref_result = torch.matmul(input_tensor, weight_tensor.t().to(torch.float16))

print("Ref output:", ref_result)
print("BitBLAS output:", output_tensor)
torch.testing.assert_close(output_tensor, ref_result, rtol=1e-2, atol=1e-0)

and the weird thing is that, the running result gives me:

Ref output: tensor([[1494., 1461., 1529.,  ..., 1508., 1525., 1446.]], device='cuda:0',
       dtype=torch.float16)
BitBLAS output: tensor([[0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0', dtype=torch.float16)
Traceback (most recent call last):
  File "/data1/speed_test/new_bitblas_test.py", line 41, in <module>
    torch.testing.assert_close(output_tensor, ref_result, rtol=1e-2, atol=1e-0)
  File "/opt/python-3.10.12/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1520, in assert_close
    raise error_metas[0].to_error(msg)
AssertionError: Tensor-likes are not close!

Mismatched elements: 2048 / 2048 (100.0%)
Greatest absolute difference: 1662.0 at index (0, 235) (up to 1.0 allowed)
Greatest relative difference: 1.0 at index (0, 0) (up to 0.01 allowed)
### Tasks
- [ ] INT8xINT4 Fast Decoding
- [ ] warp reduce API update
xysmlx commented 1 month ago

Hi, could you insert the debug code bitblas.set_log_level("Debug") in the 3rd line after import torch and print the log? Thank you.

LeiWang1999 commented 1 month ago

works on my A100.

python test_issue_131.py
Ref output: tensor([[1654., 1572., 1550.,  ..., 1519., 1561., 1584.]], device='cuda:0',
       dtype=torch.float16)
BitBLAS output: tensor([[1654., 1572., 1550.,  ..., 1520., 1561., 1585.]], device='cuda:0',
       dtype=torch.float16)
brisker commented 1 month ago

@xysmlx @LeiWang1999 a800 gpu and titan-xp gpu are tested. Below I list the logs for both of these two gpus.

error on A800:

root@train-xxxx-5-0:/data1/speed_test# python new_bitblas_test.py
TVM target not found. Please set the TVM target environment variable using `export TVM_TARGET=<target>`, where <target> is one of the available targets can be found in the output of `tools/get_available_targets.py`.
2024-08-05 20:09:25 [BitBLAS:WARNING]: TVM target not found. Please set the TVM target environment variable using `export TVM_TARGET=<target>`, where <target> is one of the available targets can be found in the output of `tools/get_available_targets.py`.
2024-08-05 20:09:25 [BitBLAS:INFO]: Auto detected target: cuda
2024-08-05 20:09:26 [BitBLAS:DEBUG]: Cannot find the appropriate index map for tensorcore
2024-08-05 20:09:51 [BitBLAS:DEBUG]: Cannot find the appropriate index map for tensorcore
2024-08-05 20:09:53 [BitBLAS:DEBUG]: Apply config {'block': [1], 'thread': [1], 'rstep': [1024], 'reduce_thread': [128], 'vectorize': {'A': 8, 'B_decode': 8}}
2024-08-05 20:09:53 [BitBLAS:DEBUG]: Apply config {'block': [2], 'thread': [2], 'rstep': [1024], 'reduce_thread': [64], 'vectorize': {'A': 8, 'B_decode': 8}}
2024-08-05 20:09:53 [BitBLAS:DEBUG]: Apply config {'block': [8], 'thread': [8], 'rstep': [1024], 'reduce_thread': [16], 'vectorize': {'A': 8, 'B_decode': 8}}
2024-08-05 20:09:53 [BitBLAS:DEBUG]: Apply config {'block': [4], 'thread': [4], 'rstep': [1024], 'reduce_thread': [32], 'vectorize': {'A': 8, 'B_decode': 8}}
2024-08-05 20:09:53 [BitBLAS:DEBUG]: Apply config {'block': [16], 'thread': [16], 'rstep': [512], 'reduce_thread': [8], 'vectorize': {'A': 4, 'B_decode': 8}}
2024-08-05 20:09:53 [BitBLAS:DEBUG]: Apply config {'block': [32], 'thread': [32], 'rstep': [256], 'reduce_thread': [4], 'vectorize': {'A': 2, 'B_decode': 8}}
2024-08-05 20:09:53 [BitBLAS:DEBUG]: Apply config {'block': [64], 'thread': [64], 'rstep': [128], 'reduce_thread': [2], 'vectorize': {'B_decode': 8}}
2024-08-05 20:09:53 [BitBLAS:DEBUG]: Apply config {'block': [128], 'thread': [128], 'rstep': [128], 'vectorize': {'B_decode': 8}}
2024-08-05 20:09:53 [BitBLAS:DEBUG]: Warning: block config [128] is not valid for matmul, skip.
2024-08-05 20:09:53 [BitBLAS:DEBUG]: Warning: block config [128] is not valid for matmul, skip.
WARNING:bitblas.utils.target_detector:TVM target not found. Please set the TVM target environment variable using `export TVM_TARGET=<target>`, where <target> is one of the available targets can be found in the output of `tools/get_available_targets.py`.
WARNING:bitblas.utils.target_detector:TVM target not found. Please set the TVM target environment variable using `export TVM_TARGET=<target>`, where <target> is one of the available targets can be found in the output of `tools/get_available_targets.py`.
WARNING:bitblas.utils.target_detector:TVM target not found. Please set the TVM target environment variable using `export TVM_TARGET=<target>`, where <target> is one of the available targets can be found in the output of `tools/get_available_targets.py`.
WARNING:bitblas.utils.target_detector:TVM target not found. Please set the TVM target environment variable using `export TVM_TARGET=<target>`, where <target> is one of the available targets can be found in the output of `tools/get_available_targets.py`.
WARNING:bitblas.utils.target_detector:TVM target not found. Please set the TVM target environment variable using `export TVM_TARGET=<target>`, where <target> is one of the available targets can be found in the output of `tools/get_available_targets.py`.
WARNING:bitblas.utils.target_detector:TVM target not found. Please set the TVM target environment variable using `export TVM_TARGET=<target>`, where <target> is one of the available targets can be found in the output of `tools/get_available_targets.py`.
WARNING:bitblas.utils.target_detector:TVM target not found. Please set the TVM target environment variable using `export TVM_TARGET=<target>`, where <target> is one of the available targets can be found in the output of `tools/get_available_targets.py`.
WARNING:bitblas.utils.target_detector:TVM target not found. Please set the TVM target environment variable using `export TVM_TARGET=<target>`, where <target> is one of the available targets can be found in the output of `tools/get_available_targets.py`.
2024-08-05 20:09:57 [BitBLAS:DEBUG]: LocalBuilder: An exception occurred Traceback (most recent call last):
  File "/opt/python-3.10.12/lib/python3.10/site-packages/bitblas/3rdparty/tvm/python/tvm/exec/popen_worker.py", line 87, in main
    result = fn(*args, **kwargs)
  File "/opt/python-3.10.12/lib/python3.10/site-packages/bitblas/base/utils.py", line 213, in _build
    rt_mod = tvm.build(mod, target=arch.target)
  File "/opt/python-3.10.12/lib/python3.10/site-packages/bitblas/3rdparty/tvm/python/tvm/driver/build_module.py", line 297, in build
    rt_mod_host = _driver_ffi.tir_to_runtime(annotated_mods, target_host)
  File "/opt/python-3.10.12/lib/python3.10/site-packages/bitblas/3rdparty/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 239, in __call__
    raise_last_ffi_error()
  File "/opt/python-3.10.12/lib/python3.10/site-packages/bitblas/3rdparty/tvm/python/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
ValueError: Traceback (most recent call last):
  68: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<tvm::runtime::Module (tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target)>::AssignTypedLambda<tvm::{lambda(tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target)#6}>(tvm::{lambda(tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target)#6}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tvm::runtime::TVMRetValue)
  67: tvm::TIRToRuntime(tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target const&)
  66: tvm::codegen::Build(tvm::IRModule, tvm::Target)
  65: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<tvm::runtime::Module (tvm::IRModule, tvm::Target)>::AssignTypedLambda<tvm::runtime::Module (*)(tvm::IRModule, tvm::Target)>(tvm::runtime::Module (*)(tvm::IRModule, tvm::Target), std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
  64: tvm::codegen::BuildCUDA(tvm::IRModule, tvm::Target)
  63: tvm::codegen::CodeGenC::AddFunction(tvm::GlobalVar const&, tvm::tir::PrimFunc const&)
  62: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::DeclBufferNode const*)
  61: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::DeclBufferNode const*)
  60: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::DeclBufferNode const*)
  59: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::DeclBufferNode const*)
  58: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::DeclBufferNode const*)
  57: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::DeclBufferNode const*)
  56: tvm::codegen::CodeGenCUDA::VisitStmt_(tvm::tir::AttrStmtNode const*)
  55: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::AttrStmtNode const*)
  54: tvm::tir::StmtFunctor<void (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
  53: tvm::codegen::CodeGenCUDA::VisitStmt_(tvm::tir::AllocateNode const*)
  52: tvm::tir::StmtFunctor<void (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
  51: tvm::codegen::CodeGenCUDA::VisitStmt_(tvm::tir::AllocateNode const*)
  50: tvm::tir::StmtFunctor<void (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
  49: tvm::codegen::CodeGenCUDA::VisitStmt_(tvm::tir::AllocateNode const*)
  48: tvm::tir::StmtFunctor<void (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
  47: tvm::codegen::CodeGenCUDA::VisitStmt_(tvm::tir::AttrStmtNode const*)
  46: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::AttrStmtNode const*)
  45: tvm::tir::StmtFunctor<void (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
  44: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::SeqStmtNode const*)
  43: tvm::codegen::CodeGenCUDA::VisitStmt_(tvm::tir::ForNode const*)
  42: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::ForNode const*)
  41: tvm::tir::StmtFunctor<void (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
  40: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::SeqStmtNode const*)
  39: tvm::codegen::CodeGenCUDA::VisitStmt_(tvm::tir::ForNode const*)
  38: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::ForNode const*)
  37: tvm::tir::StmtFunctor<void (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
  36: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::BufferStoreNode const*)
  35: tvm::codegen::CodeGenC::PrintExpr[abi:cxx11](tvm::PrimExpr const&)
  34: tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)
  33: tvm::codegen::CodeGenC::VisitExpr_(tvm::tir::SubNode const*, std::ostream&)
  32: tvm::codegen::CodeGenCUDA::PrintVecBinaryOp(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::DataType, tvm::PrimExpr, tvm::PrimExpr, std::ostream&)
  31: tvm::codegen::CodeGenC::PrintExpr[abi:cxx11](tvm::PrimExpr const&)
  30: tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)
  29: tvm::codegen::CodeGenCUDA::VisitExpr_(tvm::tir::CastNode const*, std::ostream&)
  28: tvm::codegen::CodeGenC::PrintExpr[abi:cxx11](tvm::PrimExpr const&)
  27: tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)
  26: tvm::codegen::CodeGenCUDA::VisitExpr_(tvm::tir::CallNode const*, std::ostream&)
  25: tvm::codegen::CodeGenC::VisitExpr_(tvm::tir::CallNode const*, std::ostream&)
  24: tvm::codegen::PrintBinaryIntrinsic(tvm::tir::CallNode const*, char const*, std::ostream&, tvm::codegen::CodeGenC*)
  23: tvm::codegen::CodeGenCUDA::PrintVecBinaryOp(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::DataType, tvm::PrimExpr, tvm::PrimExpr, std::ostream&)
  22: tvm::codegen::CodeGenC::PrintExpr[abi:cxx11](tvm::PrimExpr const&)
  21: tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)
  20: tvm::codegen::CodeGenCUDA::VisitExpr_(tvm::tir::CallNode const*, std::ostream&)
  19: tvm::codegen::CodeGenC::VisitExpr_(tvm::tir::CallNode const*, std::ostream&)
  18: tvm::codegen::PrintBinaryIntrinsic(tvm::tir::CallNode const*, char const*, std::ostream&, tvm::codegen::CodeGenC*)
  17: tvm::codegen::CodeGenCUDA::PrintVecBinaryOp(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::DataType, tvm::PrimExpr, tvm::PrimExpr, std::ostream&)
  16: tvm::codegen::CodeGenC::PrintExpr[abi:cxx11](tvm::PrimExpr const&)
  15: tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)
  14: tvm::codegen::CodeGenCUDA::VisitExpr_(tvm::tir::CastNode const*, std::ostream&)
  13: tvm::codegen::CodeGenC::PrintExpr[abi:cxx11](tvm::PrimExpr const&)
  12: tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)
  11: tvm::codegen::CodeGenC::VisitExpr_(tvm::tir::BufferLoadNode const*, std::ostream&)
  10: tvm::codegen::CodeGenC::PrintExpr[abi:cxx11](tvm::PrimExpr const&)
  9: tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)
  8: tvm::codegen::CodeGenC::VisitExpr_(tvm::tir::AddNode const*, std::ostream&)
  7: tvm::codegen::CodeGenCUDA::PrintVecBinaryOp(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::DataType, tvm::PrimExpr, tvm::PrimExpr, std::ostream&)
  6: tvm::codegen::CodeGenC::PrintExpr[abi:cxx11](tvm::PrimExpr const&)
  5: tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)
  4: tvm::codegen::CodeGenC::VisitExpr_(tvm::tir::DivNode const*, std::ostream&)
  3: tvm::codegen::CodeGenCUDA::PrintVecBinaryOp(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::DataType, tvm::PrimExpr, tvm::PrimExpr, std::ostream&)
  2: tvm::codegen::CodeGenC::PrintExpr[abi:cxx11](tvm::PrimExpr const&)
  1: tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)
  0: tvm::codegen::CodeGenCUDA::VisitExpr_(tvm::tir::RampNode const*, std::ostream&)
  File "/root/BitBLAS/3rdparty/tvm/src/target/source/codegen_cuda.cc", line 1224
ValueError: Check failed: lanes <= 4 (8 vs. 4) : Ramp of more than 4 lanes is not allowed.

2024-08-05 20:09:57 [BitBLAS:INFO]: Evaluation with config {'block': [1], 'thread': [1], 'rstep': [1024], 'reduce_thread': [128], 'vectorize': {'A': 8, 'B_decode': 8}}
2024-08-05 20:09:57 [BitBLAS:INFO]: Time cost of this config: 0.012 ms
2024-08-05 20:09:57 [BitBLAS:INFO]: Evaluation with config {'block': [2], 'thread': [2], 'rstep': [1024], 'reduce_thread': [64], 'vectorize': {'A': 8, 'B_decode': 8}}
2024-08-05 20:09:57 [BitBLAS:INFO]: Time cost of this config: 0.011 ms
2024-08-05 20:09:57 [BitBLAS:INFO]: Evaluation with config {'block': [8], 'thread': [8], 'rstep': [1024], 'reduce_thread': [16], 'vectorize': {'A': 8, 'B_decode': 8}}
2024-08-05 20:09:57 [BitBLAS:INFO]: Time cost of this config: 0.011 ms
2024-08-05 20:09:57 [BitBLAS:INFO]: Evaluation with config {'block': [4], 'thread': [4], 'rstep': [1024], 'reduce_thread': [32], 'vectorize': {'A': 8, 'B_decode': 8}}
2024-08-05 20:09:57 [BitBLAS:INFO]: Time cost of this config: 0.010 ms
2024-08-05 20:09:57 [BitBLAS:INFO]: Evaluation with config {'block': [16], 'thread': [16], 'rstep': [512], 'reduce_thread': [8], 'vectorize': {'A': 4, 'B_decode': 8}}
2024-08-05 20:09:57 [BitBLAS:INFO]: Time cost of this config: 0.011 ms
2024-08-05 20:09:57 [BitBLAS:INFO]: Evaluation with config {'block': [32], 'thread': [32], 'rstep': [256], 'reduce_thread': [4], 'vectorize': {'A': 2, 'B_decode': 8}}
2024-08-05 20:09:57 [BitBLAS:INFO]: Time cost of this config: 0.016 ms
2024-08-05 20:09:57 [BitBLAS:INFO]: Evaluation with config {'block': [64], 'thread': [64], 'rstep': [128], 'reduce_thread': [2], 'vectorize': {'B_decode': 8}}
2024-08-05 20:09:57 [BitBLAS:INFO]: Time cost of this config: 0.011 ms
Ref output: tensor([[1534., 1538., 1482.,  ..., 1497., 1506., 1486.]], device='cuda:0',
       dtype=torch.float16)
BitBLAS output: tensor([[0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0', dtype=torch.float16)
Traceback (most recent call last):
  File "/data1/speed_test/new_bitblas_test.py", line 41, in <module>
    torch.testing.assert_close(output_tensor, ref_result, rtol=1e-2, atol=1e-0)
  File "/opt/python-3.10.12/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1520, in assert_close
    raise error_metas[0].to_error(msg)
AssertionError: Tensor-likes are not close!

Mismatched elements: 2048 / 2048 (100.0%)
Greatest absolute difference: 1639.0 at index (0, 1267) (up to 1.0 allowed)
Greatest relative difference: 1.0 at index (0, 0) (up to 0.01 allowed)
root@train-xxxx-5-0:/data1/speed_test#

error on Titan-xp:

root@train-xxxx-4-0:/data1/speed_test# python new_bitblas_test.py
2024-08-05 20:05:57 [BitBLAS:INFO]: Auto detected target: nvidia/nvidia-titan-x
2024-08-05 20:05:57 [BitBLAS:DEBUG]: Cannot find the appropriate index map for tensorcore
/tmp/tmpjy7j7pn4.cu(456): warning #1444-D: function "__shfl_down(__half, unsigned int, int)" (declared at line 1852 of /usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp) was declared deprecated ("__shfl_down() is deprecated in favor of __shfl_down_sync() and may be removed in a future release (Use -Wno-deprecated-declarations to suppress this warning).")
    t0[0] = __shfl_down((red_buf0[0]), (16), (32));
            ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/tmp/tmpjy7j7pn4.cu(458): warning #1444-D: function "__shfl_down(__half, unsigned int, int)" (declared at line 1852 of /usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp) was declared deprecated ("__shfl_down() is deprecated in favor of __shfl_down_sync() and may be removed in a future release (Use -Wno-deprecated-declarations to suppress this warning).")
    t0[0] = __shfl_down((red_buf0[0]), (8), (32));
            ^

/tmp/tmpjy7j7pn4.cu(460): warning #1444-D: function "__shfl_down(__half, unsigned int, int)" (declared at line 1852 of /usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp) was declared deprecated ("__shfl_down() is deprecated in favor of __shfl_down_sync() and may be removed in a future release (Use -Wno-deprecated-declarations to suppress this warning).")
    t0[0] = __shfl_down((red_buf0[0]), (4), (32));
            ^

/tmp/tmpjy7j7pn4.cu(462): warning #1444-D: function "__shfl_down(__half, unsigned int, int)" (declared at line 1852 of /usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp) was declared deprecated ("__shfl_down() is deprecated in favor of __shfl_down_sync() and may be removed in a future release (Use -Wno-deprecated-declarations to suppress this warning).")
    t0[0] = __shfl_down((red_buf0[0]), (2), (32));
            ^

/tmp/tmpjy7j7pn4.cu(464): warning #1444-D: function "__shfl_down(__half, unsigned int, int)" (declared at line 1852 of /usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp) was declared deprecated ("__shfl_down() is deprecated in favor of __shfl_down_sync() and may be removed in a future release (Use -Wno-deprecated-declarations to suppress this warning).")
    t0[0] = __shfl_down((red_buf0[0]), (1), (32));
            ^

/tmp/tmpjy7j7pn4.cu(466): warning #1444-D: function "__shfl(__half, int, int)" (declared at line 1840 of /usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp) was declared deprecated ("__shfl() is deprecated in favor of __shfl_sync() and may be removed in a future release (Use -Wno-deprecated-declarations to suppress this warning).")
    red_buf0[0] = __shfl((red_buf0[0]), (0), (32));
                  ^

/tmp/tmpjy7j7pn4.cu(452): warning #550-D: variable "mask" was set but never used
    unsigned int mask[1];
                 ^

2024-08-05 20:06:35 [BitBLAS:DEBUG]: Apply config {'block': [4], 'thread': [4], 'rstep': [1024], 'reduce_thread': [32], 'vectorize': {'A': 8, 'B_decode': 8}}
2024-08-05 20:06:35 [BitBLAS:DEBUG]: Apply config {'block': [2], 'thread': [2], 'rstep': [1024], 'reduce_thread': [64], 'vectorize': {'A': 8, 'B_decode': 8}}
2024-08-05 20:06:35 [BitBLAS:DEBUG]: Apply config {'block': [1], 'thread': [1], 'rstep': [1024], 'reduce_thread': [128], 'vectorize': {'A': 8, 'B_decode': 8}}
2024-08-05 20:06:35 [BitBLAS:DEBUG]: Apply config {'block': [8], 'thread': [8], 'rstep': [1024], 'reduce_thread': [16], 'vectorize': {'A': 8, 'B_decode': 8}}
2024-08-05 20:06:35 [BitBLAS:DEBUG]: Apply config {'block': [32], 'thread': [32], 'rstep': [256], 'reduce_thread': [4], 'vectorize': {'A': 2, 'B_decode': 8}}
2024-08-05 20:06:35 [BitBLAS:DEBUG]: Apply config {'block': [16], 'thread': [16], 'rstep': [512], 'reduce_thread': [8], 'vectorize': {'A': 4, 'B_decode': 8}}
2024-08-05 20:06:35 [BitBLAS:DEBUG]: Apply config {'block': [64], 'thread': [64], 'rstep': [128], 'reduce_thread': [2], 'vectorize': {'B_decode': 8}}
2024-08-05 20:06:35 [BitBLAS:DEBUG]: Apply config {'block': [128], 'thread': [128], 'rstep': [128], 'vectorize': {'B_decode': 8}}
2024-08-05 20:06:35 [BitBLAS:DEBUG]: Warning: block config [128] is not valid for matmul, skip.
2024-08-05 20:06:35 [BitBLAS:DEBUG]: Warning: block config [128] is not valid for matmul, skip.
2024-08-05 20:06:44 [BitBLAS:DEBUG]: LocalBuilder: An exception occurred Traceback (most recent call last):
  File "/opt/python-3.10.12/lib/python3.10/site-packages/bitblas/3rdparty/tvm/python/tvm/exec/popen_worker.py", line 87, in main
    result = fn(*args, **kwargs)
  File "/opt/python-3.10.12/lib/python3.10/site-packages/bitblas/base/utils.py", line 213, in _build
    rt_mod = tvm.build(mod, target=arch.target)
  File "/opt/python-3.10.12/lib/python3.10/site-packages/bitblas/3rdparty/tvm/python/tvm/driver/build_module.py", line 297, in build
    rt_mod_host = _driver_ffi.tir_to_runtime(annotated_mods, target_host)
  File "/opt/python-3.10.12/lib/python3.10/site-packages/bitblas/3rdparty/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 239, in __call__
    raise_last_ffi_error()
  File "/opt/python-3.10.12/lib/python3.10/site-packages/bitblas/3rdparty/tvm/python/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
ValueError: Traceback (most recent call last):
  68: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<tvm::runtime::Module (tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target)>::AssignTypedLambda<tvm::{lambda(tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target)#6}>(tvm::{lambda(tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target)#6}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tvm::runtime::TVMRetValue)
  67: tvm::TIRToRuntime(tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target const&)
  66: tvm::codegen::Build(tvm::IRModule, tvm::Target)
  65: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<tvm::runtime::Module (tvm::IRModule, tvm::Target)>::AssignTypedLambda<tvm::runtime::Module (*)(tvm::IRModule, tvm::Target)>(tvm::runtime::Module (*)(tvm::IRModule, tvm::Target), std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
  64: tvm::codegen::BuildCUDA(tvm::IRModule, tvm::Target)
  63: tvm::codegen::CodeGenC::AddFunction(tvm::GlobalVar const&, tvm::tir::PrimFunc const&)
  62: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::DeclBufferNode const*)
  61: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::DeclBufferNode const*)
  60: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::DeclBufferNode const*)
  59: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::DeclBufferNode const*)
  58: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::DeclBufferNode const*)
  57: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::DeclBufferNode const*)
  56: tvm::codegen::CodeGenCUDA::VisitStmt_(tvm::tir::AttrStmtNode const*)
  55: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::AttrStmtNode const*)
  54: tvm::tir::StmtFunctor<void (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
  53: tvm::codegen::CodeGenCUDA::VisitStmt_(tvm::tir::AllocateNode const*)
  52: tvm::tir::StmtFunctor<void (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
  51: tvm::codegen::CodeGenCUDA::VisitStmt_(tvm::tir::AllocateNode const*)
  50: tvm::tir::StmtFunctor<void (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
  49: tvm::codegen::CodeGenCUDA::VisitStmt_(tvm::tir::AllocateNode const*)
  48: tvm::tir::StmtFunctor<void (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
  47: tvm::codegen::CodeGenCUDA::VisitStmt_(tvm::tir::AttrStmtNode const*)
  46: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::AttrStmtNode const*)
  45: tvm::tir::StmtFunctor<void (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
  44: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::SeqStmtNode const*)
  43: tvm::codegen::CodeGenCUDA::VisitStmt_(tvm::tir::ForNode const*)
  42: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::ForNode const*)
  41: tvm::tir::StmtFunctor<void (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
  40: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::SeqStmtNode const*)
  39: tvm::codegen::CodeGenCUDA::VisitStmt_(tvm::tir::ForNode const*)
  38: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::ForNode const*)
  37: tvm::tir::StmtFunctor<void (tvm::tir::Stmt const&)>::VisitStmt(tvm::tir::Stmt const&)
  36: tvm::codegen::CodeGenC::VisitStmt_(tvm::tir::BufferStoreNode const*)
  35: tvm::codegen::CodeGenC::PrintExpr[abi:cxx11](tvm::PrimExpr const&)
  34: tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)
  33: tvm::codegen::CodeGenC::VisitExpr_(tvm::tir::SubNode const*, std::ostream&)
  32: tvm::codegen::CodeGenCUDA::PrintVecBinaryOp(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::DataType, tvm::PrimExpr, tvm::PrimExpr, std::ostream&)
  31: tvm::codegen::CodeGenC::PrintExpr[abi:cxx11](tvm::PrimExpr const&)
  30: tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)
  29: tvm::codegen::CodeGenCUDA::VisitExpr_(tvm::tir::CastNode const*, std::ostream&)
  28: tvm::codegen::CodeGenC::PrintExpr[abi:cxx11](tvm::PrimExpr const&)
  27: tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)
  26: tvm::codegen::CodeGenCUDA::VisitExpr_(tvm::tir::CallNode const*, std::ostream&)
  25: tvm::codegen::CodeGenC::VisitExpr_(tvm::tir::CallNode const*, std::ostream&)
  24: tvm::codegen::PrintBinaryIntrinsic(tvm::tir::CallNode const*, char const*, std::ostream&, tvm::codegen::CodeGenC*)
  23: tvm::codegen::CodeGenCUDA::PrintVecBinaryOp(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::DataType, tvm::PrimExpr, tvm::PrimExpr, std::ostream&)
  22: tvm::codegen::CodeGenC::PrintExpr[abi:cxx11](tvm::PrimExpr const&)
  21: tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)
  20: tvm::codegen::CodeGenCUDA::VisitExpr_(tvm::tir::CallNode const*, std::ostream&)
  19: tvm::codegen::CodeGenC::VisitExpr_(tvm::tir::CallNode const*, std::ostream&)
  18: tvm::codegen::PrintBinaryIntrinsic(tvm::tir::CallNode const*, char const*, std::ostream&, tvm::codegen::CodeGenC*)
  17: tvm::codegen::CodeGenCUDA::PrintVecBinaryOp(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::DataType, tvm::PrimExpr, tvm::PrimExpr, std::ostream&)
  16: tvm::codegen::CodeGenC::PrintExpr[abi:cxx11](tvm::PrimExpr const&)
  15: tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)
  14: tvm::codegen::CodeGenCUDA::VisitExpr_(tvm::tir::CastNode const*, std::ostream&)
  13: tvm::codegen::CodeGenC::PrintExpr[abi:cxx11](tvm::PrimExpr const&)
  12: tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)
  11: tvm::codegen::CodeGenC::VisitExpr_(tvm::tir::BufferLoadNode const*, std::ostream&)
  10: tvm::codegen::CodeGenC::PrintExpr[abi:cxx11](tvm::PrimExpr const&)
  9: tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)
  8: tvm::codegen::CodeGenC::VisitExpr_(tvm::tir::AddNode const*, std::ostream&)
  7: tvm::codegen::CodeGenCUDA::PrintVecBinaryOp(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::DataType, tvm::PrimExpr, tvm::PrimExpr, std::ostream&)
  6: tvm::codegen::CodeGenC::PrintExpr[abi:cxx11](tvm::PrimExpr const&)
  5: tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)
  4: tvm::codegen::CodeGenC::VisitExpr_(tvm::tir::DivNode const*, std::ostream&)
  3: tvm::codegen::CodeGenCUDA::PrintVecBinaryOp(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::DataType, tvm::PrimExpr, tvm::PrimExpr, std::ostream&)
  2: tvm::codegen::CodeGenC::PrintExpr[abi:cxx11](tvm::PrimExpr const&)
  1: tvm::codegen::CodeGenC::PrintExpr(tvm::PrimExpr const&, std::ostream&)
  0: tvm::codegen::CodeGenCUDA::VisitExpr_(tvm::tir::RampNode const*, std::ostream&)
  File "/root/BitBLAS/3rdparty/tvm/src/target/source/codegen_cuda.cc", line 1224
ValueError: Check failed: lanes <= 4 (8 vs. 4) : Ramp of more than 4 lanes is not allowed.

2024-08-05 20:06:44 [BitBLAS:INFO]: Evaluation with config {'block': [4], 'thread': [4], 'rstep': [1024], 'reduce_thread': [32], 'vectorize': {'A': 8, 'B_decode': 8}}
2024-08-05 20:06:44 [BitBLAS:INFO]: Time cost of this config: 0.136 ms
2024-08-05 20:06:44 [BitBLAS:INFO]: Evaluation with config {'block': [2], 'thread': [2], 'rstep': [1024], 'reduce_thread': [64], 'vectorize': {'A': 8, 'B_decode': 8}}
2024-08-05 20:06:44 [BitBLAS:INFO]: Time cost of this config: 0.109 ms
2024-08-05 20:06:44 [BitBLAS:INFO]: Evaluation with config {'block': [1], 'thread': [1], 'rstep': [1024], 'reduce_thread': [128], 'vectorize': {'A': 8, 'B_decode': 8}}
2024-08-05 20:06:44 [BitBLAS:INFO]: Time cost of this config: 0.096 ms
2024-08-05 20:06:44 [BitBLAS:INFO]: Evaluation with config {'block': [8], 'thread': [8], 'rstep': [1024], 'reduce_thread': [16], 'vectorize': {'A': 8, 'B_decode': 8}}
2024-08-05 20:06:44 [BitBLAS:INFO]: Time cost of this config: 0.107 ms
2024-08-05 20:06:44 [BitBLAS:INFO]: Evaluation with config {'block': [32], 'thread': [32], 'rstep': [256], 'reduce_thread': [4], 'vectorize': {'A': 2, 'B_decode': 8}}
2024-08-05 20:06:44 [BitBLAS:INFO]: Time cost of this config: 0.111 ms
2024-08-05 20:06:44 [BitBLAS:INFO]: Evaluation with config {'block': [16], 'thread': [16], 'rstep': [512], 'reduce_thread': [8], 'vectorize': {'A': 4, 'B_decode': 8}}
2024-08-05 20:06:44 [BitBLAS:INFO]: Time cost of this config: 0.093 ms
2024-08-05 20:06:44 [BitBLAS:INFO]: Evaluation with config {'block': [64], 'thread': [64], 'rstep': [128], 'reduce_thread': [2], 'vectorize': {'B_decode': 8}}
2024-08-05 20:06:44 [BitBLAS:INFO]: Time cost of this config: 0.142 ms
/tmp/tmpsfswnatl.cu(456): warning #1444-D: function "__shfl_down(__half, unsigned int, int)" (declared at line 1852 of /usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp) was declared deprecated ("__shfl_down() is deprecated in favor of __shfl_down_sync() and may be removed in a future release (Use -Wno-deprecated-declarations to suppress this warning).")
    t0[0] = __shfl_down((red_buf0[0]), (4), (32));
            ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/tmp/tmpsfswnatl.cu(458): warning #1444-D: function "__shfl_down(__half, unsigned int, int)" (declared at line 1852 of /usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp) was declared deprecated ("__shfl_down() is deprecated in favor of __shfl_down_sync() and may be removed in a future release (Use -Wno-deprecated-declarations to suppress this warning).")
    t0[0] = __shfl_down((red_buf0[0]), (2), (32));
            ^

/tmp/tmpsfswnatl.cu(460): warning #1444-D: function "__shfl_down(__half, unsigned int, int)" (declared at line 1852 of /usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp) was declared deprecated ("__shfl_down() is deprecated in favor of __shfl_down_sync() and may be removed in a future release (Use -Wno-deprecated-declarations to suppress this warning).")
    t0[0] = __shfl_down((red_buf0[0]), (1), (32));
            ^

/tmp/tmpsfswnatl.cu(462): warning #1444-D: function "__shfl(__half, int, int)" (declared at line 1840 of /usr/local/cuda/bin/../targets/x86_64-linux/include/cuda_fp16.hpp) was declared deprecated ("__shfl() is deprecated in favor of __shfl_sync() and may be removed in a future release (Use -Wno-deprecated-declarations to suppress this warning).")
    red_buf0[0] = __shfl((red_buf0[0]), ((((int)threadIdx.y) * 8)), (32));
                  ^

/tmp/tmpsfswnatl.cu(452): warning #550-D: variable "mask" was set but never used
    unsigned int mask[1];
                 ^

Ref output: tensor([[1565., 1554., 1552.,  ..., 1550., 1512., 1554.]], device='cuda:0',
       dtype=torch.float16)
BitBLAS output: tensor([[0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0', dtype=torch.float16)
Traceback (most recent call last):
  File "/data1/speed_test/new_bitblas_test.py", line 41, in <module>
    torch.testing.assert_close(output_tensor, ref_result, rtol=1e-2, atol=1e-0)
  File "/opt/python-3.10.12/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1520, in assert_close
    raise error_metas[0].to_error(msg)
AssertionError: Tensor-likes are not close!

Mismatched elements: 2048 / 2048 (100.0%)
Greatest absolute difference: 1661.0 at index (0, 1429) (up to 1.0 allowed)
Greatest relative difference: 1.0 at index (0, 0) (up to 0.01 allowed)
LeiWang1999 commented 1 month ago

Hi @brisker , which cuda version in your enviroment?

xysmlx commented 1 month ago

Could you print the terminal output of the command nvcc --version to show the called cuda version in BitBLAS?

We have a known issue similar to this in CUDA 12.5.

brisker commented 1 month ago
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0
>>> import torch
>>> torch.__version__
'2.1.2+cu121'
brisker commented 1 month ago

@LeiWang1999 @xysmlx I tried cuda_12.4, and the same error

brisker commented 1 month ago

@LeiWang1999 Besides, when I tried cuda 12.1, work well, but if I modify the config into

mm = 256
nn = 2048
kk = 1024
matmul_config = bitblas.MatmulConfig(
    M=mm,  # M dimension
    N=nn,  # N dimension
    K=kk,  # K dimension
    A_dtype="int8",  # activation A dtype
    W_dtype="int4",  # weight W dtype
    accum_dtype="int32",  # accumulation dtype
    out_dtype="float32",  # output dtype
    layout="nt",  # matrix layout, "nt" indicates the layout of A is non-transpose and the layout of W is transpose
    with_bias=False,  # bias
    # configs for weight only quantization
    group_size=None,  # setting for grouped quantization
    with_scaling=False,  # setting for scaling factor
    with_zeros=False,  # setting for zeros
    zeros_mode=None,  # setting for how to calculating zeros
) 

why does it gives me:

Traceback (most recent call last):
  File "speed_compare.py", line 26, in <module>
    matmul = bitblas.Matmul(config=matmul_config)
  File "/usr/local/miniconda3/lib/python3.8/site-packages/bitblas/ops/general_matmul/__init__.py", line 243, in __init__
    self.dispatch_tir(target, from_database, source_format, enable_tuning)
  File "/usr/local/miniconda3/lib/python3.8/site-packages/bitblas/ops/general_matmul/__init__.py", line 294, in dispatch_tir
    self.hardware_aware_finetune()
  File "/usr/local/miniconda3/lib/python3.8/site-packages/bitblas/ops/operator.py", line 206, in hardware_aware_finetune
    self.optimized_func = self.apply_fast_tuning(
  File "/usr/local/miniconda3/lib/python3.8/site-packages/bitblas/ops/operator.py", line 178, in apply_fast_tuning
    self.pass_context = best.config.pass_context
AttributeError: 'NoneType' object has no attribute 'config'

I am using A800-gpu

LeiWang1999 commented 1 month ago

@LeiWang1999 Besides, when I tried cuda 12.1, work well, but if I modify the config into

mm = 256
nn = 2048
kk = 1024
matmul_config = bitblas.MatmulConfig(
    M=mm,  # M dimension
    N=nn,  # N dimension
    K=kk,  # K dimension
    A_dtype="int8",  # activation A dtype
    W_dtype="int4",  # weight W dtype
    accum_dtype="int32",  # accumulation dtype
    out_dtype="float32",  # output dtype
    layout="nt",  # matrix layout, "nt" indicates the layout of A is non-transpose and the layout of W is transpose
    with_bias=False,  # bias
    # configs for weight only quantization
    group_size=None,  # setting for grouped quantization
    with_scaling=False,  # setting for scaling factor
    with_zeros=False,  # setting for zeros
    zeros_mode=None,  # setting for how to calculating zeros
) 

why does it gives me:

Traceback (most recent call last):
  File "speed_compare.py", line 26, in <module>
    matmul = bitblas.Matmul(config=matmul_config)
  File "/usr/local/miniconda3/lib/python3.8/site-packages/bitblas/ops/general_matmul/__init__.py", line 243, in __init__
    self.dispatch_tir(target, from_database, source_format, enable_tuning)
  File "/usr/local/miniconda3/lib/python3.8/site-packages/bitblas/ops/general_matmul/__init__.py", line 294, in dispatch_tir
    self.hardware_aware_finetune()
  File "/usr/local/miniconda3/lib/python3.8/site-packages/bitblas/ops/operator.py", line 206, in hardware_aware_finetune
    self.optimized_func = self.apply_fast_tuning(
  File "/usr/local/miniconda3/lib/python3.8/site-packages/bitblas/ops/operator.py", line 178, in apply_fast_tuning
    self.pass_context = best.config.pass_context
AttributeError: 'NoneType' object has no attribute 'config'

I am using A800-gpu

@brisker , the int4xint8 is not fully tested yet, you can use the code below:

matmul_config = bitblas.MatmulConfig(
    M=mm,  # M dimension
    N=nn,  # N dimension
    K=kk,  # K dimension
    A_dtype="int8",  # activation A dtype
    W_dtype="int4",  # weight W dtype
    accum_dtype="int32",  # accumulation dtype
    out_dtype="float32",  # output dtype
    layout="nt",  # matrix layout, "nt" indicates the layout of A is non-transpose and the layout of W is transpose
    with_bias=False,  # bias
    # configs for weight only quantization
    group_size=None,  # setting for grouped quantization
    with_scaling=False,  # setting for scaling factor
    with_zeros=False,  # setting for zeros
    zeros_mode=None,  # setting for how to calculating zeros
    fast_decoding=False
) 

to disable the fast type conversion, we will fix it soon.

brisker commented 1 month ago

@LeiWang1999

@LeiWang1999 Besides, when I tried cuda 12.1, work well, but if I modify the config into

mm = 256
nn = 2048
kk = 1024
matmul_config = bitblas.MatmulConfig(
    M=mm,  # M dimension
    N=nn,  # N dimension
    K=kk,  # K dimension
    A_dtype="int8",  # activation A dtype
    W_dtype="int4",  # weight W dtype
    accum_dtype="int32",  # accumulation dtype
    out_dtype="float32",  # output dtype
    layout="nt",  # matrix layout, "nt" indicates the layout of A is non-transpose and the layout of W is transpose
    with_bias=False,  # bias
    # configs for weight only quantization
    group_size=None,  # setting for grouped quantization
    with_scaling=False,  # setting for scaling factor
    with_zeros=False,  # setting for zeros
    zeros_mode=None,  # setting for how to calculating zeros
) 

why does it gives me:

Traceback (most recent call last):
  File "speed_compare.py", line 26, in <module>
    matmul = bitblas.Matmul(config=matmul_config)
  File "/usr/local/miniconda3/lib/python3.8/site-packages/bitblas/ops/general_matmul/__init__.py", line 243, in __init__
    self.dispatch_tir(target, from_database, source_format, enable_tuning)
  File "/usr/local/miniconda3/lib/python3.8/site-packages/bitblas/ops/general_matmul/__init__.py", line 294, in dispatch_tir
    self.hardware_aware_finetune()
  File "/usr/local/miniconda3/lib/python3.8/site-packages/bitblas/ops/operator.py", line 206, in hardware_aware_finetune
    self.optimized_func = self.apply_fast_tuning(
  File "/usr/local/miniconda3/lib/python3.8/site-packages/bitblas/ops/operator.py", line 178, in apply_fast_tuning
    self.pass_context = best.config.pass_context
AttributeError: 'NoneType' object has no attribute 'config'

I am using A800-gpu

@brisker , the int4xint8 is not fully tested yet, you can use the ``` matmul_config = bitblas.MatmulConfig( M=mm, # M dimension N=nn, # N dimension K=kk, # K dimension A_dtype="int8", # activation A dtype W_dtype="int4", # weight W dtype accum_dtype="int32", # accumulation dtype out_dtype="float32", # output dtype layout="nt", # matrix layout, "nt" indicates the layout of A is non-transpose and the layout of W is transpose with_bias=False, # bias # configs for weight only quantization group_size=None, # setting for grouped quantization with_scaling=False, # setting for scaling factor with_zeros=False, # setting for zeros zeros_mode=None, # setting for how to calculating zeros fast_decoding=False )


to disable the fast type conversion, we will fix it soon.

will this slower the w4a8 gemm speed?

Besides, cuda 12.3 12.4 12.5 all seem to have similar bugs according to my test.

brisker commented 1 month ago

@LeiWang1999 I just use the w4a8 of yours, and keeping fast_decoding=False, and the gemm time is 0.12 seconds for

m = 256
n = 2048
k = 1024

and the gemm time is 0.0003 seconds for w4a8 if using this method

Will fast_decoding=False cause such tremendous difference for w4a8?

LeiWang1999 commented 1 month ago

@brisker would you mind provide your benchmark scripts?

brisker commented 1 month ago

@LeiWang1999 Besides, in the codes below, if out_dtype="float32"is modified into out_dtype="float16", there is also bugs: AttributeError: 'NoneType' object has no attribute 'config'


import time
import bitblas 
import torch
mm = 256
nn_n = 2048
kk = 1024
act_dtype = "int8"
fast_decoding = False
matmul_config = bitblas.MatmulConfig(
    M=mm,  # M dimension
    N=nn_n,  # N dimension
    K=kk,  # K dimension
    A_dtype=act_dtype,  # activation A dtype
    W_dtype="int4",  # weight W dtype
    accum_dtype="int32",  # accumulation dtype
    out_dtype="float32",  # output dtype
    layout="nt",  # matrix layout, "nt" indicates the layout of A is non-transpose and the layout of W is transpose
    with_bias=False,  # bias
    # configs for weight only quantization
    group_size=None,  # setting for grouped quantization
    with_scaling=False,  # setting for scaling factor
    with_zeros=False,  # setting for zeros
    zeros_mode=None,  # setting for how to calculating zeros
    fast_decoding=fast_decoding
)

bitblas_matmul = bitblas.Matmul(config=matmul_config)

with torch.no_grad():
    input = torch.Tensor(mm,nn_n).normal_().cuda().half()
    quant_input, quant_input_scale,dequant_input = dynamic_quant(input)

    scale,scale_extra = get_scale(ori_fc.weight.data,group_size)
    qqq_linear.pack(ori_fc, scale, scale_extra)
    quant_w, w_scale = w4_quant(ori_fc.weight.data,group_size)   # not ok, but after bug is fixed, now is ok

    # Create input matrices
    input_tensor = quant_input.to(torch.int8).cuda() if act_dtype=="int8" else quant_input.half().cuda()
    # weight_tensor = ori_fc.weight.data.to(torch.int8).cuda()

    # Transform weight tensor to int4 data type
    # import pdb;pdb.set_trace()
    weight_tensor_int4 = bitblas_matmul.transform_weight(quant_w.T)

# with torch.no_grad():
#     out1 = ori_fc(dequant_input.half())

with torch.no_grad():

    time1=time.time()
    output_tensor = bitblas_matmul(input_tensor, weight_tensor_int4)
    time2=time.time()
    print(f"bitblas_matmul_time: {time2-time1}")

# print((out1==out2).sum(),out1.numel())
# print(out1)
# print(out2)

# assert(torch.allclose(out1, out2, atol=1e-3, rtol=1e-3))
LeiWang1999 commented 1 month ago

@brisker , for small shapes, you should run at least 1k iters and get the average runtime, moreover, torch.cuda.synchronize should be applied before the time2=time.time(), otherwise the time is indeterminate.you can also use the bitblas_matmul.profile_latency() to get the kernel performance.

brisker commented 1 month ago

@LeiWang1999 bitblas_matmul.profile_latency() gives me 0.025, but compared to 0.0003, still big difference. I think this difference can not be due to any other reason but the w4a8 implementation itself.

Will fast_decoding=False cause such tremendous difference for w4a8? And the out_dtypehas to be float32( otherwise float16 causes error) , this may also has some influence---In other w4a8 pipelines, out_dtype is float16

LeiWang1999 commented 1 month ago

@brisker the unit of the api profile_latency is ms.

brisker commented 1 month ago

@LeiWang1999

with torch.no_grad():
    time1=time.time()
    out1 = w4a8_qserve_linear.forward(input)
    torch.cuda.synchronize()
    time2=time.time()
    print(f"qserve_linear_time: {time2-time1}")

    time3=time.time()
    output_tensor = bitblas_matmul(quanti_input_tensor, weight_tensor_int4)
    torch.cuda.synchronize()
    time4=time.time()
    print(f"bitblas_matmul_time: {time4-time3}")
    print(f"bitblas_matmul_profile_latency--: {bitblas_matmul.profile_latency()}")

The codes above give me:

qserve_linear_time: 0.0004611015319824219
bitblas_matmul_time: 0.11247825622558594
bitblas_matmul_profile_latency--: 0.0241664
  1. (time4-time3) a lot bigger than(time2-time1) is still very weird to me, since torch.cuda.synchronize() is already added.

  2. Besides, why (time4-time3) is also a lot bigger than bitblas_matmul.profile_latency() too? I mean, no matter how we get the latency, (time4-time3) can always be the standards if applying bitblas to LLM-quantization to accelerate the inference.

LeiWang1999 commented 1 month ago

@brisker , for benchmarking, it is crucial to ensure that the program runs multiple times to minimize the impact of time measurement errors. Single runs can produce inaccurate results.

brisker commented 1 month ago

@LeiWang1999 I tried multiple runs, from 10 to 500, 5k, 50k, the results are, below 1k runs, bitblas is consistenly slower, but when comes to 5k, 50k runs, bitblas is faster, with bitblas_matmul.profile_latency() is also consistenly lower.

But in other benchmark test codes, normally, 10 or 100 runs are enough. Why are your w4a8 ops so sensitive to running times?

LeiWang1999 commented 1 month ago

I don't know why you point out that the bitblas_matmul.profile_latency() is also consistenly lower. because as you mentioned, the qserve_linear_time is 0.0004611015319824219 s -> 46 us, while bitblas is 0.0241664 ms -> 24 us?

LeiWang1999 commented 1 month ago

Moreover, normally, 10 or 100 runs are not enough for my experience when the kernel runtime is under us level.

brisker commented 1 month ago

To make it clear, I plan to test them on real llama2-7B models to test the real w4a8-speed-up performance. Thanks for your patient replies!

Besides, will fast_decoding=False be harmful for bitblas w4a8? Since currently, fast_decoding=True has bugs.

@LeiWang1999

LeiWang1999 commented 1 month ago

Hi @brisker . To accurately test the real speed-up, profile_latency() performs correctly for our experience. However, I recommend using nsys or nvprof or run with multiple times to track the actual kernel runtime as these tools provide more precise and reliable performance metrics for real word workload rather than profiling with time only ones.

I believe fast_decoding=False doesn’t significantly impact int8xint4 for the shapes you’re profiling. However, it’s always better to apply fast_decode for optimal performance.

LeiWang1999 commented 1 month ago

Hi @brisker , we now support INT8xINT4 Fast Decoding and fixed the compile issues. If you have any further questions about this issue, feel free to follow up this thread!

brisker commented 1 month ago

@LeiWang1999 May I ask which part of the w4a8 pipeline do you call "fast decoding"?

LeiWang1999 commented 1 month ago

@brisker Sure, the python side, flag_decoding will enable a tensorization schedule to replace the decode with LOP3 instead of type conversion instructions. https://github.com/microsoft/BitBLAS/blob/60f3e5dedf411361f877de1443b5a596e00d342a/bitblas/gpu/matmul_mma_dequantize.py#L1016

From cuda side, you can observe some device functions start with decode_

xysmlx commented 1 month ago

@LeiWang1999 We may need a fallback setting when fast_decoding is not available. This can be implemented by maintaining a list for fast_decoding-supported data types and automatically setting fast_decoding=False when current data type is not in the list. It would be preferable not to expose performance flags.