cuBLAS API failed with status 15 - Error #174

Open rmivdc opened 1 year ago

rmivdc commented 1 year ago

Hi, During the command launch i'm encoutering this error titled above. i'm using Fedora 36 with Cuda12, Python 3.10.10, initializing seems begining like so :

CUDA SETUP: CUDA runtime path found: /usr/local/cuda-12.0/lib64/ CUDA SETUP: Highest compute capability among GPUs detected: 8.6 CUDA SETUP: Detected CUDA version 120

and then later after loading some files :

Loading cached split indices for dataset at /home/rmivdc/.cache/huggingface/datasets/json/default-fac87d4e05e14783/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-e521db28b6879419.arrow and /home/rmivdc/.cache/huggingface/datasets/json/default-fac87d4e05e14783/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-eb712e2459ca28b6.arrow /home/rmivdc/.local/lib/python3.10/site-packages/transformers/ FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning warnings.warn( 0%| | 0/1170 [00:00<?, ?it/s]cuBLAS API failed with status 15 A: torch.Size([2048, 4096]), B: torch.Size([4096, 4096]), C: (2048, 4096); (lda, ldb, ldc): (c_int(65536), c_int(131072), c_int(65536)); (m, n, k): (c_int(2048), c_int(4096), c_int(4096))

am i using some wrong libs versions ? thx for your help

loganlebanoff commented 1 year ago

I ran into this issue as well with torch==2.0. When I uninstalled it and re-installed as torch==1.13.1, then it seemed to fix the issue.

rmivdc commented 1 year ago

Thanks ! this version fixed it. EDIT : at least for cpu running, gpu running still throws that error

loganlebanoff commented 1 year ago

The error went away for me on GPU

rmivdc commented 1 year ago

May i know what Cuda version are you using / nvidia drivers version

leehanchung commented 1 year ago

CUDA 12 is not compatible with PyTorch 2.0.

Following is the Release Compatibility Matrix for PyTorch releases:

PyTorch version | Python | Stable CUDA | Experimental CUDA -- | -- | -- | -- 2.0 | >=3.8, <=3.11 | CUDA 11.7, CUDNN | CUDA 11.8, CUDNN 1.13 | >=3.7, <=3.10 | CUDA 11.6, CUDNN | CUDA 11.7, CUDNN 1.12 | >=3.7, <=3.10 | CUDA 11.3, CUDNN | CUDA 11.6, CUDNN

Also, Python 3.11 is not compatible either; the max version is 3.10.

mudomau commented 1 year ago

Getting the same issue here trying to run inference on the google t5-xl model.


cuBLAS API failed with status 15
A: torch.Size([1, 2048]), B: torch.Size([2048, 2048]), C: (1, 2048); (lda, ldb, ldc): (c_int(32), c_int(65536), c_int(32)); (m, n, k): (c_int(1), c_int(2048), c_int(2048))
 File "/home/mau/.conda/envs/test/lib/python3.9/site-packages/bitsandbytes/autograd/", line 377, in forward
    out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
  File "/home/mau/.conda/envs/test/lib/python3.9/site-packages/bitsandbytes/", line 1410, in igemmlt
    raise Exception('cublasLt ran into an error!')
Exception: cublasLt ran into an error!

I've tried all the fixes proposed here but no luck.

Environment packages:

rmivdc commented 1 year ago

@mudomau Do you have the same issue with "decapoda-research/llama-7b-hf" ?

I'm encountering another error now but the last Dockerfile install uploaded 3 days ago fixed that cuBLAS error for me.

samuelcardoso commented 1 year ago

same problem here.

trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
A: torch.Size([5120, 4096]), B: torch.Size([4096, 4096]), C: (5120, 4096); (lda, ldb, ldc): (c_int(163840), c_int(131072), c_int(163840)); (m, n, k): (c_int(5120), c_int(4096), c_int(4096))
cuBLAS API failed with status 15
error detected
$ nvidia-smi
Tue Apr 11 21:25:11 2023       
| NVIDIA-SMI 510.108.03   Driver Version: 510.108.03   CUDA Version: 11.6     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  Off  | 00000000:08:00.0  On |                  N/A |
|  0%   53C    P8    18W / 220W |   1020MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |

| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|    0   N/A  N/A      1292      G   /usr/lib/xorg/Xorg                460MiB |
|    0   N/A  N/A      1577      G   /usr/bin/gnome-shell              172MiB |
|    0   N/A  N/A      3884      G   ...RendererForSitePerProcess       86MiB |
|    0   N/A  N/A      5441      G   ...983706979455292193,131072      249MiB |
$ /usr/local/cuda-11.6/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Fri_Dec_17_18:16:03_PST_2021
Cuda compilation tools, release 11.6, V11.6.55
Build cuda_11.6.r11.6/compiler.30794723_0
arvindsun commented 1 year ago

I am running into the same issue as well on a H100:

torch 1.13.1, bitsandbytes==0.38.1, cuda 11.8, python 3.10, cublas

    result = super().forward(x)
  File "/home/arvind/.local/lib/python3.10/site-packages/bitsandbytes/nn/", line 320, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/home/arvind/.local/lib/python3.10/site-packages/bitsandbytes/autograd/", line 500, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/home/arvind/.local/lib/python3.10/site-packages/bitsandbytes/autograd/", line 397, in forward
    out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
  File "/home/arvind/.local/lib/python3.10/site-packages/bitsandbytes/", line 1436, in igemmlt
    raise Exception('cublasLt ran into an error!')
> nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
SVEEu commented 1 year ago

Same issue comes to me when finetuning 30b and 65b models, even on different clouds.

For 65b model, it randomly occurs with a probability of about 70%. For 30b model, it occurs every time.

Malfaro43 commented 1 year ago

I am running into the same issue as well on a H100:

torch 1.13.1, bitsandbytes==0.38.1, cuda 11.8, python 3.10, cublas

    result = super().forward(x)

  File "/home/arvind/.local/lib/python3.10/site-packages/bitsandbytes/nn/", line 320, in forward

    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)

  File "/home/arvind/.local/lib/python3.10/site-packages/bitsandbytes/autograd/", line 500, in matmul

    return MatMul8bitLt.apply(A, B, out, bias, state)

  File "/home/arvind/.local/lib/python3.10/site-packages/bitsandbytes/autograd/", line 397, in forward

    out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)

  File "/home/arvind/.local/lib/python3.10/site-packages/bitsandbytes/", line 1436, in igemmlt

    raise Exception('cublasLt ran into an error!')

> nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver

Copyright (c) 2005-2022 NVIDIA Corporation

Built on Wed_Sep_21_10:33:58_PDT_2022

Cuda compilation tools, release 11.8, V11.8.89

Build cuda_11.8.r11.8/compiler.31833905_0

@arvindsun Have you fixed this? I'm also running into this issue when using an H100 on Lambda Labs.

daniel-furman commented 1 year ago

Getting the same error on an H100 on Lambda Labs

jonataslaw commented 1 year ago

Getting the same error on an H100 on Lambda Labs too

leehanchung commented 1 year ago

Getting the same error on an H100 on Lambda Labs too

Try to run it w/o 8-bit mode since you are on H100

jonataslaw commented 1 year ago

Getting the same error on an H100 on Lambda Labs too

Try to run it w/o 8-bit mode since you are on H100

I tried it.

Lambda instances of H100 has cuda 11.8, and pytorch 2.0.1 compiled to 117, which is not compatible. the bitsandbytes version also has a problem, and you need to rename the cuda version you are using.

I tried to install cuda version 12 too, to use the latest version of torch, but strangely the installation is aborted, without fail, so I gave up on testing it on the H100, I had already spent 3h of my time trying to configure it. I'll try it on another runpod instance, as locally I could successfully train it with 3 epochs, but I needed more computation to train it with 10, my RTX4090 will take weeks for it.

zubair-ahmed-ai commented 1 year ago

Facing the same error on lambda labds H100 instance trying to load Falcon-40B in 8 bit, what's the solution?

jonataslaw commented 1 year ago

export this variables:

export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Install the compatible cuda (11.7 hasn't support to H100):

sudo apt install cuda-nvcc-11-8 libcusparse-11-8 libcusparse-dev-11-8 libcublas-dev-11-8 libcublas-11-8 libcusolver-dev-11-8 libcusolver-11-8

Remove old cuda:

apt remove cuda-nvcc-11-7

Install the compatible pytorch:

pip install torch==2.0.0+cu118 torchvision==0.15.1+cu118 torchaudio==2.0.0 --extra-index-url
pip install pytorch-lightning==1.9.0

If you will use deepspeed to make CPU offload (it makes the train faster) you need:

pip install deepspeed==0.7.0

Edit these files (using VIM, nano, or SFPT) changing the import for inf from torch._six with import from math

Thytu commented 1 year ago

Facing the same error on lambda labds H100 instance trying to load Falcon-40B in 8 bit, what's the solution?

Ended up moving back to an A100 😅

daniel-furman commented 1 year ago

Has anyone else tried and confirmed the efficacy of @jonataslaw's solution two comments above? Will test myself over the weekend.

daniel-furman commented 1 year ago

I was able to solve this error with the conda install approach found here:

# jupyter setup
source ~/.bashrc

conda create --name cap
conda activate cap
conda install pip
conda install cudatoolkit
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH

git clone
cd bitsandbytes
CUDA_VERSION=118 make cuda11x
python install

pip install scipy
python -m bitsandbytes
# should be successfull build
huawei-lin commented 1 year ago

I met this issue on H100 GPU, and fixed it by changing load_in_8bit=True to load_in_8bit=False in the 114-th line of

zubair-ahmed-ai commented 1 year ago


I was able to solve this error with the conda install approach found here: TimDettmers/bitsandbytes#85

# jupyter setup
source ~/.bashrc

conda create --name cap
conda activate cap
conda install pip
conda install cudatoolkit
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH

git clone
cd bitsandbytes
CUDA_VERSION=118 make cuda11x
python install

pip install scipy
python -m bitsandbytes
# should be successfull build

Sadly it gave me the below error

Downloading (…)fetensors.index.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 36.2k/36.2k [00:00<00:00, 10.6MB/s]
Downloading (…)of-00004.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.96G/9.96G [03:00<00:00, 55.3MB/s]
Downloading (…)of-00004.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.86G/9.86G [02:57<00:00, 55.4MB/s]
Downloading (…)of-00004.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.86G/9.86G [02:57<00:00, 55.4MB/s]
Downloading (…)of-00004.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.36G/1.36G [00:24<00:00, 55.2MB/s]
Downloading shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [09:22<00:00, 140.63s/it]

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to:
bin /home/ubuntu/miniconda/lib/python3.10/site-packages/bitsandbytes/
/home/ubuntu/miniconda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/ UserWarning: Found duplicate ['', '', ''] files: {PosixPath('/home/ubuntu/miniconda/envs/starchat/lib/'), PosixPath('/home/ubuntu/miniconda/envs/starchat/lib/')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['', '', ''] in the paths that we search based on your env.
CUDA SETUP: CUDA runtime path found: /home/ubuntu/miniconda/envs/starchat/lib/
CUDA SETUP: Highest compute capability among GPUs detected: 9.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/ubuntu/miniconda/lib/python3.10/site-packages/bitsandbytes/
Loading checkpoint shards:   0%|                                                                                                                                                 | 0/4 [00:00<?, ?it/s]Error named symbol not found at line 528 in file /mmfs1/gscratch/zlab/timdettmers/git/bitsandbytes/csrc/
Jacobsolawetz commented 1 year ago

Got this issue on H100 on runpod

HaishuoFang commented 1 year ago

same got this on H100 with 8bit. H100 works with 16bits

jieWANGforwork commented 11 months ago

Got this error on H100 using 8bit Llama. If anyone can make it on H100?

huawei-lin commented 11 months ago

Got this error on H100 using 8bit Llama. If anyone can make it on H100?

You can avoid to use 8 bit. 4bit and 16bit are fine.