paperswithcode / galai

Model API for GALACTICA
Apache License 2.0
2.68k stars 276 forks source link

CUDA error: CUBLAS_STATUS_INVALID_VALUE when trying to get query project #19

Closed FelipeMLopez closed 1 year ago

FelipeMLopez commented 1 year ago

Hi! First of all, thank you very much for this awesome project. I'm facing some issues when trying to execute this code:

import galai as gal

model = gal.load_model(name="base", num_gpus=1)
model.generate("Scaled dot product attention:\n\n\\[")

This is the output error:

Traceback (most recent call last):
  File "/home/fmlopez/launch_galactica.py", line 4, in <module>
    model.generate("Scaled dot product attention:\n\n\\[")
  File "/home/fmlopez/venv/lib/python3.9/site-packages/galai/model.py", line 136, in generate
    out = self.model.generate(
  File "/home/fmlopez/venv/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/fmlopez/venv/lib/python3.9/site-packages/transformers/generation_utils.py", line 1490, in generate
    return self.greedy_search(
  File "/home/fmlopez/venv/lib/python3.9/site-packages/transformers/generation_utils.py", line 2233, in greedy_search
    outputs = self(
  File "/home/fmlopez/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/fmlopez/venv/lib/python3.9/site-packages/galai/architecture.py", line 965, in forward
    outputs = self.model.decoder(
  File "/home/fmlopez/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/fmlopez/venv/lib/python3.9/site-packages/galai/architecture.py", line 726, in forward
    layer_outputs = decoder_layer(
  File "/home/fmlopez/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/fmlopez/venv/lib/python3.9/site-packages/accelerate/hooks.py", line 156, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/fmlopez/venv/lib/python3.9/site-packages/galai/architecture.py", line 328, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/fmlopez/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/fmlopez/venv/lib/python3.9/site-packages/accelerate/hooks.py", line 156, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/fmlopez/venv/lib/python3.9/site-packages/galai/architecture.py", line 178, in forward
    query_states = self.q_proj(hidden_states) * self.scaling
  File "/home/fmlopez/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/fmlopez/venv/lib/python3.9/site-packages/accelerate/hooks.py", line 156, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/fmlopez/venv/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 116, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

It seems to have problems when trying to get query project (line 178 @ architecture.py) Does anyone have the same problem?

Thanks for the help!

FelipeMLopez commented 1 year ago

It seems that have found the problem. Apparently there was a misalignment between the CUDA drivers version and the version on Torch. This is the version I had on Torch:

Requirement already satisfied: nvidia-cuda-runtime-cu11==11.7.99 in ./venv/lib/python3.9/site-packages (from torch) (11.7.99)
Requirement already satisfied: nvidia-cuda-nvrtc-cu11==11.7.99 in ./venv/lib/python3.9/site-packages (from torch) (11.7.99)

And this is the version of nvcc installed:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Aug_15_21:14:11_PDT_2021
Cuda compilation tools, release 11.4, V11.4.120
Build cuda_11.4.r11.4/compiler.30300941_0

I simply downgrade to the most recent previous version:

pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113

It works fine now. Thank you!