microsoft / BitBLAS

BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
MIT License
322 stars 29 forks source link

How to measure and compare the time in QuickStart #89

Closed ZiqingChang closed 1 month ago

ZiqingChang commented 1 month ago

Hello,

I measured the time between BitBlas matmul and normal torch.matmul in your QuickStart code, but there appears to be no speedup. Am I missing something?

import bitblas
import torch

import time

# enabling debug output

bitblas.set_log_level("Debug")
matmul_config = bitblas.MatmulConfig(
    M=1,  # M dimension
    N=1024,  # N dimension
    K=1024,  # K dimension
    A_dtype="float16",  # activation A dtype
    W_dtype="int2",  # weight W dtype
    accum_dtype="float16",  # accumulation dtype
    out_dtype="float16",  # output dtype
    layout="nt",  # matrix layout, "nt" indicates the layout of A is non-transpose and the layout of W is transpose
    with_bias=False,  # bias
    # configs for weight only quantization
    group_size=None,  # setting for grouped quantization
    with_scaling=False,  # setting for scaling factor
    with_zeros=False,  # setting for zeros
    zeros_mode=None,  # setting for how to calculating zeros
)

matmul = bitblas.Matmul(config=matmul_config)

# Create input matrices
input_tensor = torch.rand((1, 1024), dtype=torch.float16).cuda()
weight_tensor = torch.randint(-1, 2, (1024, 1024), dtype=torch.int8).cuda()

# Transform weight tensor to int4 data type
weight_tensor_int4 = matmul.transform_weight(weight_tensor)

start_time = time.time()
# Perform mixed-precision matrix multiplication
output_tensor = matmul(input_tensor, weight_tensor_int4)
bitblas_time = time.time() - start_time

start_time = time.time()
# Reference result using PyTorch matmul for comparison
ref_result = torch.matmul(input_tensor, weight_tensor.t().to(torch.float16))
ref_time = time.time() - start_time

print(f"BitBLAS Time: {bitblas_time * 1000:.3f} ms")
print(f"Ref Time: {ref_time * 1000:.3f} ms")

# Assert that the results are close within a specified tolerance, note that the int4 randint value is a little bigger than the float16 value, so we set the atol to 1.0
print("Ref output:", ref_result)
print("BitBLAS output:", output_tensor)
torch.testing.assert_close(output_tensor, ref_result, rtol=1e-2, atol=1e-0)

The output is BitBLAS Time: 64.350 ms Ref Time: 61.494 ms

The time mearured appears to be no speedup, am I missing something?

LeiWang1999 commented 1 month ago

Hi @ZiqingChang , when benchmarking in PyTorch, remember to include the synchronization step.

start_time = time.time()
# Perform mixed-precision matrix multiplication
output_tensor = matmul(input_tensor, weight_tensor_int4)
torch.cuda.synchronize()
bitblas_time = time.time() - start_time

start_time = time.time()
# Reference result using PyTorch matmul for comparison
ref_result = torch.matmul(input_tensor, weight_tensor.t().to(torch.float16))
torch.cuda.synchronize()
ref_time = time.time() - start_time
ZiqingChang commented 1 month ago

Hello @LeiWang1999 , but even after I put torch.cuda.synchronize() in, there is still no speedup.

import bitblas
import torch

import time

# enabling debug output

bitblas.set_log_level("Debug")
matmul_config = bitblas.MatmulConfig(
    M=1,  # M dimension
    N=1024,  # N dimension
    K=1024,  # K dimension
    A_dtype="float16",  # activation A dtype
    W_dtype="int2",  # weight W dtype
    accum_dtype="float16",  # accumulation dtype
    out_dtype="float16",  # output dtype
    layout="nt",  # matrix layout, "nt" indicates the layout of A is non-transpose and the layout of W is transpose
    with_bias=False,  # bias
    # configs for weight only quantization
    group_size=None,  # setting for grouped quantization
    with_scaling=False,  # setting for scaling factor
    with_zeros=False,  # setting for zeros
    zeros_mode=None,  # setting for how to calculating zeros
)

matmul = bitblas.Matmul(config=matmul_config)

# Create input matrices
input_tensor = torch.rand((1, 1024), dtype=torch.float16).cuda()
weight_tensor = torch.randint(-1, 2, (1024, 1024), dtype=torch.int8).cuda()

# Transform weight tensor to int2 data type
weight_tensor_int2 = matmul.transform_weight(weight_tensor)

start_time = time.time()
# Perform mixed-precision matrix multiplication
output_tensor = matmul(input_tensor, weight_tensor_int2)
torch.cuda.synchronize()
bitblas_time = time.time() - start_time

weight_tensor = weight_tensor.t().to(torch.float16)

start_time = time.time()
# Reference result using PyTorch matmul for comparison
ref_result = torch.matmul(input_tensor, weight_tensor)
torch.cuda.synchronize()
ref_time = time.time() - start_time

print(f"BitBLAS Time: {bitblas_time * 1000:.3f} ms")
print(f"Ref Time: {ref_time * 1000:.3f} ms")

# Assert that the results are close within a specified tolerance, note that the int4 randint value is a little bigger than the float16 value, so we set the atol to 1.0
print("Ref output:", ref_result)
print("BitBLAS output:", output_tensor)
torch.testing.assert_close(output_tensor, ref_result, rtol=1e-2, atol=1e-0)

The output is: BitBLAS Time: 62.542 ms Ref Time: 47.760 ms

How can I see the speedup between BitBlas' matmul and the normal torch.matmul ?

LeiWang1999 commented 1 month ago

Hi @ZiqingChang , would you mind provide your device information? The result on my A100 is

BitBLAS Time: 20.005 ms
Ref Time: 58.918 ms

Additionally, the 1x1024x1024 matmul should only take several us for execution, so the package time may execute both these operations multiple times to achieve an accurate benchmark,

print(f"BitBLAS Time: {bitblas_time * 1000:.3f} ms")
print(f"Ref Time: {ref_time * 1000:.3f} ms")

# Assert that the results are close within a specified tolerance, note that the int4 randint value is a little bigger than the float16 value, so we set the atol to 1.0
print("Ref output:", ref_result)
print("BitBLAS output:", output_tensor)
torch.testing.assert_close(output_tensor, ref_result, rtol=1e-2, atol=1e-0)

# benchmark latency for bitblas
latency = matmul.profile_latency()
print(f"BitBLAS Time: {latency:.3f} ms")

def profile(model, *args):

    import numpy as np

    def get_runtime(num_repeats=1):
        tic = time.time()
        for _ in range(num_repeats):
            _ = model(*args)
        torch.cuda.synchronize()
        return (time.time() - tic) * 1000 / num_repeats

    with torch.no_grad():
        st = time.time()
        while time.time() - st < 1.0:
            get_runtime()  # warmup
        warmup_runtime = get_runtime()
        num_repeats = max(1, int(1000 / warmup_runtime))
        times = get_runtime(num_repeats)
    return np.mean(times)

torch_time = profile(torch.matmul, input_tensor, weight_tensor.t().to(torch.float16))

print(f"Torch Time: {torch_time:.3f} ms")

The result on my device is:

BitBLAS Time: 23.031 ms
Ref Time: 56.993 ms
Ref output: tensor([[-14.1484,   6.2852, -14.3203,  ...,   9.4609, -14.7891,   7.5977]],
       device='cuda:0', dtype=torch.float16)
BitBLAS output: tensor([[-14.1562,   6.2734, -14.3125,  ...,   9.4375, -14.8047,   7.6094]],
       device='cuda:0', dtype=torch.float16)
BitBLAS Time: 0.004 ms
Torch Time: 0.007 ms
LeiWang1999 commented 1 month ago

And we put the benchmarking scripts to reproduce the figure under the benchmark directory.