tloen / llama-int8

Quantized inference code for LLaMA models
GNU General Public License v3.0
1.05k stars 105 forks source link

13B - load is successful on T4, but forward pass fails #2

Open deep-diver opened 1 year ago

deep-diver commented 1 year ago

any clues?

I had 30GB RAM, and I used 2MBx13000 swapfiles with the following command : sudo dd if=/dev/zero of=/swapfile bs=2M count=13000 status=progress

Allocating transformer on host
Loading checkpoint 0
Loading checkpoint 1

Loaded in 2590.17 seconds with 13.19 GiB
cuBLAS API failed with status 15
A: torch.Size([72, 5120]), B: torch.Size([5120, 5120]), C: (72, 5120); (lda, ldb, ldc): (c_int(2304), c_int(163840), c_int(2304)); (m, n, k): (c_int(72), c_int(5120), c_int(5120))
error detectedTraceback (most recent call last):
  File "/home/jupyter/llama-int8/example.py", line 117, in <module>
    fire.Fire(main)
  File "/opt/conda/envs/pt/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/opt/conda/envs/pt/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/opt/conda/envs/pt/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/jupyter/llama-int8/example.py", line 107, in main
    results = generator.generate(
  File "/home/jupyter/llama-int8/llama/generation.py", line 42, in generate
    logits = self.model.forward(tokens[:, prev_pos:cur_pos], prev_pos)
  File "/opt/conda/envs/pt/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/jupyter/llama-int8/llama/model.py", line 281, in forward
    h = layer(h, start_pos, freqs_cis, mask)
  File "/opt/conda/envs/pt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/jupyter/llama-int8/llama/model.py", line 221, in forward
    h = x + self.attention.forward(
  File "/home/jupyter/llama-int8/llama/model.py", line 142, in forward
    xq, xk, xv = self.wq(x), self.wk(x), self.wv(x)
  File "/opt/conda/envs/pt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/envs/pt/lib/python3.9/site-packages/bitsandbytes/nn/modules.py", line 242, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/opt/conda/envs/pt/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/opt/conda/envs/pt/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 377, in forward
    out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
  File "/opt/conda/envs/pt/lib/python3.9/site-packages/bitsandbytes/functional.py", line 1410, in igemmlt
    raise Exception('cublasLt ran into an error!')
Exception: cublasLt ran into an error!