microsoft / BitBLAS

BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
MIT License
190 stars 21 forks source link

[Question] why is so slow to instantiate a bitblas linear layer #40

Closed ChenMnZ closed 3 weeks ago

ChenMnZ commented 1 month ago

When I instantiate a bitblas linear, I have to wait a number of minutes.

I want to now how can i instantiate it more fast.

My code is:

linear_layer = bitblas.Linear(
    in_features=128,
    out_features=128,
    bias=False,
    A_dtype="float16",  # activation A dtype
    W_dtype="int2",  # weight W dtype
    accum_dtype="float16",  # accumulation dtype
    out_dtype="float16",  # output dtype
    # configs for weight only quantization
    group_size=64,  # setting for grouped quantization
    with_scaling=False,  # setting for scaling factor
    with_zeros=False,  # setting for zeros
    opt_M=[1, 16, 32, 64, 128, 256, 512],
)
LeiWang1999 commented 1 month ago

hi, @ChenMnZ , That's because bitblas fine-tunes many kernels upon your initial launch of Linear with the config, actually the compilation results of bitblas linear witll be saved unde ~/.cache/bitblas, the next time you instantiate a linear with the same configuration, the tuning proc will be skipped.

And if you want to disable the tuning proc, you can instantiate the bitblas linear with enable_tuning=False.

ChenMnZ commented 1 month ago

Thanks for your quick answering.

Additional question is that the forward process of the aforementioned linear_layer would meet Segmentation fault in my testing.

Is this a bug? or due to some environment mismatch? image

ChenMnZ commented 1 month ago

The testing code is as follows:

import torch
import torch.nn as nn
import time
import bitblas
import pdb

# 定义设备,优先使用GPU
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--in_channels", type=int, default=4096)
parser.add_argument("--out_channels", type=int, default=4096)
args = parser.parse_args()

print(f'testing in_c {args.in_channels} out_c {args.out_channels}')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# 创建线性层
# linear_layer = nn.Linear(args.in_channels, args.out_channels,bias=False).to(device=device, dtype=torch.float16)
linear_layer = bitblas.Linear(
    in_features=args.in_channels,
    out_features=args.out_channels,
    bias=False,
    A_dtype="float16",  # activation A dtype
    W_dtype="int2",  # weight W dtype
    accum_dtype="float16",  # accumulation dtype
    out_dtype="float16",  # output dtype
    # configs for weight only quantization
    group_size=64,  # setting for grouped quantization
    with_scaling=False,  # setting for scaling factor
    with_zeros=False,  # setting for zeros
    # zeros_mode='original',  # setting for how to calculating zeros
    # zeros_mode='quantized',  # setting for how to calculating zeros
    # Target optimization var for dynamic symbolic.
    # For detailed information please checkout docs/PythonAPI.md
    # By default, the optimization var is [1, 16, 32, 64, 128, 256, 512]
    opt_M=[1, 16, 32, 64, 128, 256, 512],
    enable_tuning=False,
)

# pdb.set_trace()

# 制作一个符合输入尺寸的随机张量
input_tensor = torch.randn(1, args.in_channels).to(device=device, dtype=torch.float16)

# 使用CUDA事件进行精确时间测量
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)

# 预热,避免初次使用GPU时的启动开销影响时间测量
for i in range(100):
    print(i)
    _ = linear_layer(input_tensor)

# 开始计时
start_event.record()

# 执行前向传播次数
num_trials = 1000
for i in range(num_trials):
    print(i)
    output = linear_layer(input_tensor)

# 结束计时
end_event.record()
torch.cuda.synchronize()  # 等待事件记录完成

# 计算平均时间
elapsed_time_ms = start_event.elapsed_time(end_event)  # 毫秒
average_time_us = elapsed_time_ms / num_trials * 1000  # 转换为微秒

# import pdb;pdb.set_trace()
print(f"in_c {args.in_channels} out_c {args.out_channels} 平均每次计算时间: {average_time_us:.0f} 微秒")

The error not occur in the first forward process, but in the second forward process.

The output is:

testing in_c 128 out_c 128
BitBLAS Operator created.
0
1
Segmentation fault

It is strange that why the first time work while the second time failed.

LeiWang1999 commented 3 weeks ago

You forgot to move the Linear Layer to CUDA.

linear_layer = linear_layer.cuda()
LeiWang1999 commented 3 weeks ago

closed :)