Open yisunlp opened 2 months ago
hi, if you are using accelerate to distribute your model to multi-GPUs, you should add "LlamaDecoderLayer_KIVI" to the "no_split_module_class" like
device_map = infer_auto_device_map(
model, no_split_module_classes=["LlamaDecoderLayer_KIVI"], ****map_kwargs)**
and according to my experience, this may help
# this is the original code located in KIVI/quant/new_pack.py:232
# _minmax_along_last_dim[grid](data, mn, mx,
data.numel(), data.shape[0], num_groups, group_size,
BLOCK_SIZE_N=BLOCK_SIZE_N, num_warps=8)
# modified code
with torch.cuda.device(data.device):
_minmax_along_last_dim[grid](data, mn, mx,
data.numel(), data.shape[0], num_groups, group_size,
BLOCK_SIZE_N=BLOCK_SIZE_N, num_warps=8)
# some other code...
with torch.cuda.device(data.device):
_pack_along_last_dim[grid](bit, data, code, data.shape[0],
data.shape[1], feat_per_int,
BLOCK_SIZE_N=BLOCK_SIZE_N,
num_warps=8)
I changed my code and got
Could you please provide the original code for testing memory and multi-batch speed?
I am not the paper author nor the repo owner... I am the one who opened issue24 several months ago... and I have never encountered this error before. Good luck.
I solved the problem, thank you very much for your help.
I ran mem_spd_test.py and got the following error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! I did not make any changes except the path of the model. I manually changed the device and got the same error as https://github.com/jy-yuan/KIVI/issues/24 Any suggestions?