jy-yuan / KIVI

[ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
https://arxiv.org/abs/2402.02750
MIT License
241 stars 23 forks source link

Multi GPUs #28

Open yisunlp opened 2 months ago

yisunlp commented 2 months ago

I ran mem_spd_test.py and got the following error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! I did not make any changes except the path of the model. I manually changed the device and got the same error as https://github.com/jy-yuan/KIVI/issues/24 Any suggestions?

image

xzwj1699 commented 2 months ago

hi, if you are using accelerate to distribute your model to multi-GPUs, you should add "LlamaDecoderLayer_KIVI" to the "no_split_module_class" like

device_map = infer_auto_device_map(
                model, no_split_module_classes=["LlamaDecoderLayer_KIVI"], ****map_kwargs)**

and according to my experience, this may help

# this is the original code located in KIVI/quant/new_pack.py:232
# _minmax_along_last_dim[grid](data, mn, mx,
      data.numel(), data.shape[0], num_groups, group_size,
      BLOCK_SIZE_N=BLOCK_SIZE_N, num_warps=8) 

# modified code
with torch.cuda.device(data.device):
  _minmax_along_last_dim[grid](data, mn, mx,
        data.numel(), data.shape[0], num_groups, group_size,
        BLOCK_SIZE_N=BLOCK_SIZE_N, num_warps=8) 
# some other code...
with torch.cuda.device(data.device):
  _pack_along_last_dim[grid](bit, data, code, data.shape[0], 
    data.shape[1], feat_per_int, 
    BLOCK_SIZE_N=BLOCK_SIZE_N, 
    num_warps=8)
yisunlp commented 2 months ago

I changed my code and got image

yisunlp commented 2 months ago

Could you please provide the original code for testing memory and multi-batch speed?

xzwj1699 commented 2 months ago

I am not the paper author nor the repo owner... I am the one who opened issue24 several months ago... and I have never encountered this error before. Good luck.

yisunlp commented 2 months ago

I solved the problem, thank you very much for your help.