Closed ChenMnZ closed 3 weeks ago
hi, @ChenMnZ , That's because bitblas fine-tunes many kernels upon your initial launch of Linear with the config, actually the compilation results of bitblas linear witll be saved unde ~/.cache/bitblas
, the next time you instantiate a linear with the same configuration, the tuning proc will be skipped.
And if you want to disable the tuning proc, you can instantiate the bitblas linear with enable_tuning=False
.
Thanks for your quick answering.
Additional question is that the forward process of the aforementioned linear_layer would meet Segmentation fault
in my testing.
Is this a bug? or due to some environment mismatch?
The testing code is as follows:
import torch
import torch.nn as nn
import time
import bitblas
import pdb
# 定义设备,优先使用GPU
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--in_channels", type=int, default=4096)
parser.add_argument("--out_channels", type=int, default=4096)
args = parser.parse_args()
print(f'testing in_c {args.in_channels} out_c {args.out_channels}')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# 创建线性层
# linear_layer = nn.Linear(args.in_channels, args.out_channels,bias=False).to(device=device, dtype=torch.float16)
linear_layer = bitblas.Linear(
in_features=args.in_channels,
out_features=args.out_channels,
bias=False,
A_dtype="float16", # activation A dtype
W_dtype="int2", # weight W dtype
accum_dtype="float16", # accumulation dtype
out_dtype="float16", # output dtype
# configs for weight only quantization
group_size=64, # setting for grouped quantization
with_scaling=False, # setting for scaling factor
with_zeros=False, # setting for zeros
# zeros_mode='original', # setting for how to calculating zeros
# zeros_mode='quantized', # setting for how to calculating zeros
# Target optimization var for dynamic symbolic.
# For detailed information please checkout docs/PythonAPI.md
# By default, the optimization var is [1, 16, 32, 64, 128, 256, 512]
opt_M=[1, 16, 32, 64, 128, 256, 512],
enable_tuning=False,
)
# pdb.set_trace()
# 制作一个符合输入尺寸的随机张量
input_tensor = torch.randn(1, args.in_channels).to(device=device, dtype=torch.float16)
# 使用CUDA事件进行精确时间测量
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
# 预热,避免初次使用GPU时的启动开销影响时间测量
for i in range(100):
print(i)
_ = linear_layer(input_tensor)
# 开始计时
start_event.record()
# 执行前向传播次数
num_trials = 1000
for i in range(num_trials):
print(i)
output = linear_layer(input_tensor)
# 结束计时
end_event.record()
torch.cuda.synchronize() # 等待事件记录完成
# 计算平均时间
elapsed_time_ms = start_event.elapsed_time(end_event) # 毫秒
average_time_us = elapsed_time_ms / num_trials * 1000 # 转换为微秒
# import pdb;pdb.set_trace()
print(f"in_c {args.in_channels} out_c {args.out_channels} 平均每次计算时间: {average_time_us:.0f} 微秒")
The error not occur in the first forward process, but in the second forward process.
The output is:
testing in_c 128 out_c 128
BitBLAS Operator created.
0
1
Segmentation fault
It is strange that why the first time work while the second time failed.
You forgot to move the Linear Layer to CUDA.
linear_layer = linear_layer.cuda()
closed :)
When I instantiate a bitblas linear, I have to wait a number of minutes.
I want to now how can i instantiate it more fast.
My code is: