Mismatch between Bitblas result and torch.matmul in QuickStart.md with batch size > 1

MekkCyber commented 3 months ago

Hi everyone !

When I try to run the first script in QuickStart.md, using a batch size of 1 in the input tensor it works fine, i have the following result :

WARNING:bitblas.utils.target_detector:TVM target not found. Please set the TVM target environment variable using export 
TVM_TARGET=<target>, where <target> is one of the available targets can be found in the output of tools/get_available_targets.py.

Ref output: tensor([[1496., 1481., 1515.,  ..., 1547., 1532., 1532.]], device='cuda:0',
       dtype=torch.float16)
BitBLAS output: tensor([[1496., 1481., 1514.,  ..., 1546., 1532., 1532.]], device='cuda:0', dtype=torch.float16)

But once i use a batch size of 4 for example, I have too many mismatches

AssertionError: Tensor-likes are not close!  
Mismatched elements: 3072 / 4096 (75.0%) 
Greatest absolute difference: 8648.0 at index (2, 857) (up to 1.0 allowed)  
Greatest relative difference: 5.96484375 at index (3, 393) (up to 0.01 allowed)

Is this an intended behaviour ? Thanks for your help

I am using Python 3.10.14 bitblas 0.0.1.dev12

### Tasks
- [ ] Append Static Shape Check

LeiWang1999 commented 3 months ago

@MekkCyber The code works fine on my env:

import bitblas
import torch

# enabling debug output

bitblas.set_log_level("Debug")
matmul_config = bitblas.MatmulConfig(
    M=4,  # M dimension
    N=1024,  # N dimension
    K=1024,  # K dimension
    A_dtype="float16",  # activation A dtype
    W_dtype="int4",  # weight W dtype
    accum_dtype="float16",  # accumulation dtype
    out_dtype="float16",  # output dtype
    layout="nt",  # matrix layout, "nt" indicates the layout of A is non-transpose and the layout of W is transpose
    with_bias=False,  # bias
    # configs for weight only quantization
    group_size=None,  # setting for grouped quantization
    with_scaling=False,  # setting for scaling factor
    with_zeros=False,  # setting for zeros
    zeros_mode=None,  # setting for how to calculating zeros
)

matmul = bitblas.Matmul(config=matmul_config)

# Create input matrices
input_tensor = torch.rand((4, 1024), dtype=torch.float16).cuda()
weight_tensor = torch.randint(0, 7, (1024, 1024), dtype=torch.int8).cuda()

# Transform weight tensor to int4 data type
weight_tensor_int4 = matmul.transform_weight(weight_tensor)

# Perform mixed-precision matrix multiplication
output_tensor = matmul(input_tensor, weight_tensor_int4)

# Reference result using PyTorch matmul for comparison
ref_result = torch.matmul(input_tensor, weight_tensor.t().to(torch.float16))
# Assert that the results are close within a specified tolerance, note that the int4 randint value is a little bigger than the float16 value, so we set the atol to 1.0
print("Ref output:", ref_result)
print("BitBLAS output:", output_tensor)
torch.testing.assert_close(output_tensor, ref_result, rtol=1e-2, atol=1e-0)

It seems you may have forgotten to set the input tensor to (4, 1024), instead keeping it at (1, 1024), given that the mismatch in elements is 75%.

we should consider appending a static shape check in future releases.

MekkCyber commented 3 months ago

Yeah sorry I didn't set M to 4 in matmul_config ! thanks for your help @LeiWang1999

w32zhong commented 2 months ago

@LeiWang1999 I would like to ask a follow-up question, if we were running language models that had dynamic batch/sequence size, how should we configure the bitblas matmul in those cases?

LeiWang1999 commented 2 months ago

@w32zhong , set M into a List, for example, [1, 16, 32, 64], and BitBlas will generate dynamic kernels for each M, dispatching shapes like 31 to the kernel with M=32.

w32zhong commented 2 months ago

@w32zhong , set M into a List, for example, [1, 16, 32, 64], and BitBlas will generate dynamic kernels for each M, dispatching shapes like 31 to the kernel with M=32.

Thank you so much. I actually just figured it out. Looks like dynamic shape will also match other batch shapes not specified by M, I guess those M values get passed in as tuple will get particularly optimized after hardware finetune?

LeiWang1999 commented 2 months ago

yes, you may also want to checkout the generated cuda source from ~/.cache/bitblas/nvidia . @w32zhong

brisker commented 1 month ago

@LeiWang1999 Hi, I tried this

@w32zhong , set M into a List, for example, [1, 16, 32, 64], and BitBlas will generate dynamic kernels for each M, dispatching shapes like 31 to the kernel with M=32.

Thank you so much. I actually just figured it out. Looks like dynamic shape will also match other batch shapes not specified by M, I guess those M values get passed in as tuple will get particularly optimized after hardware finetune?

@LeiWang1999 If I want to use bitblas to do batched-matmul in LLMs, which means (B,M,N) * (N,K), how to set the matmul config?

LeiWang1999 commented 1 month ago

@brisker Hi, This is a matmul expression instead of a batch matmul operator (Batch Matmul is [B, M, K] [B, K, N]), you can view the tensor from (B, M , K) into (B M, K) and then apply the matmul kernel.

microsoft / BitBLAS

Mismatch between Bitblas result and torch.matmul in QuickStart.md with batch size > 1 #56