cuBLAS API failed with status 15 - Error

rmivdc commented 1 year ago

Hi, During the finetune.py command launch i'm encoutering this error titled above. i'm using Fedora 36 with Cuda12, Python 3.10.10, initializing seems begining like so :

CUDA SETUP: CUDA runtime path found: /usr/local/cuda-12.0/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.6 CUDA SETUP: Detected CUDA version 120

and then later after loading some files :

Loading cached split indices for dataset at /home/rmivdc/.cache/huggingface/datasets/json/default-fac87d4e05e14783/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-e521db28b6879419.arrow and /home/rmivdc/.cache/huggingface/datasets/json/default-fac87d4e05e14783/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-eb712e2459ca28b6.arrow /home/rmivdc/.local/lib/python3.10/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning warnings.warn( 0%| | 0/1170 [00:00<?, ?it/s]cuBLAS API failed with status 15 A: torch.Size([2048, 4096]), B: torch.Size([4096, 4096]), C: (2048, 4096); (lda, ldb, ldc): (c_int(65536), c_int(131072), c_int(65536)); (m, n, k): (c_int(2048), c_int(4096), c_int(4096))

am i using some wrong libs versions ? thx for your help

loganlebanoff commented 1 year ago

I ran into this issue as well with torch==2.0. When I uninstalled it and re-installed as torch==1.13.1, then it seemed to fix the issue.

rmivdc commented 1 year ago

Thanks ! this version fixed it. EDIT : at least for cpu running, gpu running still throws that error

loganlebanoff commented 1 year ago

The error went away for me on GPU

rmivdc commented 1 year ago

The error went away for me on GPU

May i know what Cuda version are you using / nvidia drivers version and your :

accelerate appdirs bitsandbytes black black[jupyter] datasets fire gradio

pip packages versions ? (if not last one used)

thanks !

loganlebanoff commented 1 year ago

CUDA 11.7. Also I'm used conda for install pytorch with cuda (conda install pytorch=1.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia)

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0

$ nvidia-smi                                                                                                                                                        Mon Mar 27 20:19:20 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB           On | 00000000:05:00.0 Off |                    0 |
| N/A   29C    P0               63W / 400W|   7429MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB           On | 00000000:06:00.0 Off |                    0 |
| N/A   26C    P0               63W / 400W|   7717MiB / 81920MiB |      0%      Default |

$ pip list
Package             Version     Editable project location
------------------- ----------- ------------------------------
accelerate          0.18.0
aiofiles            23.1.0
aiohttp             3.8.4
aiosignal           1.3.1
altair              4.2.2
anyio               3.6.2
appdirs             1.4.4
async-timeout       4.0.2
attrs               22.2.0
bitsandbytes        0.37.2
certifi             2022.12.7
charset-normalizer  3.1.0
click               8.1.3
contourpy           1.0.7
cycler              0.11.0
datasets            2.10.1
deepspeed           0.8.3
defusedxml          0.7.1
dill                0.3.6
entrypoints         0.4
fastapi             0.95.0
ffmpy               0.3.0
filelock            3.10.6
fire                0.5.0
flit_core           3.8.0
fonttools           4.39.2
frozenlist          1.3.3
fsspec              2023.3.0
Glances             3.3.1.1
gradio              3.23.0
h11                 0.14.0
hjson               3.1.0
httpcore            0.16.3
httptools           0.5.0
httpx               0.23.3
huggingface-hub     0.13.3
idna                3.4
importlib-resources 5.12.0
Jinja2              3.1.2
jmespath            1.0.1
jsonschema          4.17.3
kiwisolver          1.4.4
linkify-it-py       2.0.0
loralib             0.1.1
markdown-it-py      2.2.0
MarkupSafe          2.1.2
matplotlib          3.7.1
mdit-py-plugins     0.3.3
mdurl               0.1.2
multidict           6.0.4
multiprocess        0.70.14
ninja               1.11.1
numpy               1.24.2
openai              0.27.2
orjson              3.8.8
packaging           23.0
pandas              1.5.3
peft                0.3.0.dev0  /home/fsuser/peft
Pillow              9.4.0
pip                 23.0.1
psutil              5.9.4
py-cpuinfo          9.0.0
pyarrow             11.0.0
pydantic            1.10.7
pydub               0.25.1
pyparsing           3.0.9
pyrsistent          0.19.3
python-dateutil     2.8.2
python-dotenv       1.0.0
python-multipart    0.0.6
pytz                2023.2
PyYAML              6.0
regex               2023.3.23
requests            2.28.2
responses           0.18.0
rfc3986             1.5.0
semantic-version    2.10.0
sentencepiece       0.1.97
setuptools          65.6.3
six                 1.16.0
sniffio             1.3.0
starlette           0.26.1
termcolor           2.2.0
tokenizers          0.13.2
toolz               0.12.0
torch               1.13.1
tqdm                4.65.0
transformers        4.28.0.dev0 /home/fsuser/transformers_main
typing_extensions   4.4.0
uc-micro-py         1.0.1
ujson               5.7.0
urllib3             1.26.15
uvicorn             0.21.1
uvloop              0.17.0
watchfiles          0.18.1
websockets          10.4
wheel               0.38.4
xxhash              3.2.0
yarl                1.8.2
zipp                3.15.0

leehanchung commented 1 year ago

CUDA 12 is not compatible with PyTorch 2.0.

https://github.com/pytorch/pytorch/blob/master/RELEASE.md#release-compatibility-matrix

Following is the Release Compatibility Matrix for PyTorch releases:

Also, Python 3.11 is not compatible either; the max version is 3.10.

mudomau commented 1 year ago

Getting the same issue here trying to run inference on the google t5-xl model.

Error:

cuBLAS API failed with status 15
A: torch.Size([1, 2048]), B: torch.Size([2048, 2048]), C: (1, 2048); (lda, ldb, ldc): (c_int(32), c_int(65536), c_int(32)); (m, n, k): (c_int(1), c_int(2048), c_int(2048))
...
 File "/home/mau/.conda/envs/test/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 377, in forward
    out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
  File "/home/mau/.conda/envs/test/lib/python3.9/site-packages/bitsandbytes/functional.py", line 1410, in igemmlt
    raise Exception('cublasLt ran into an error!')
Exception: cublasLt ran into an error!

I've tried all the fixes proposed here but no luck.

Environment packages:

_libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
accelerate 0.18.0 pypi_0 pypi bitsandbytes 0.37.2 pypi_0 pypi blas 1.0 mkl
brotlipy 0.7.0 py39h27cfd23_1003
bzip2 1.0.8 h7b6447c_0
ca-certificates 2023.01.10 h06a4308_0 anaconda certifi 2022.12.7 py39h06a4308_0 anaconda cffi 1.15.1 py39h5eee18b_3
charset-normalizer 2.0.4 pyhd3eb1b0_0
cryptography 39.0.1 py39h9ce1e76_0
cuda-cudart 11.7.99 0 nvidia cuda-cupti 11.7.101 0 nvidia cuda-libraries 11.7.1 0 nvidia cuda-nvrtc 11.7.99 0 nvidia cuda-nvtx 11.7.91 0 nvidia cuda-runtime 11.7.1 0 nvidia cudatoolkit 11.3.1 h2bc3f7f_2 anaconda ffmpeg 4.3 hf484d3e_0 pytorch filelock 3.10.7 pypi_0 pypi flit-core 3.8.0 py39h06a4308_0
freetype 2.12.1 h4a9f257_0
giflib 5.2.1 h5eee18b_3
gmp 6.2.1 h295c915_3
gnutls 3.6.15 he1e5248_0
huggingface-hub 0.13.3 pypi_0 pypi idna 3.4 py39h06a4308_0
intel-openmp 2021.4.0 h06a4308_3561
jpeg 9e h5eee18b_1
lame 3.100 h7b6447c_0
lcms2 2.12 h3be6417_0
ld_impl_linux-64 2.38 h1181459_1
lerc 3.0 h295c915_0
libcublas 11.10.3.66 0 nvidia libcufft 10.7.2.124 h4fbf590_0 nvidia libcufile 1.6.0.25 0 nvidia libcurand 10.3.2.56 0 nvidia libcusolver 11.4.0.1 0 nvidia libcusparse 11.7.4.91 0 nvidia libdeflate 1.17 h5eee18b_0
libffi 3.4.2 h6a678d5_6
libgcc-ng 11.2.0 h1234567_1
libgomp 11.2.0 h1234567_1
libiconv 1.16 h7f8727e_2
libidn2 2.3.2 h7f8727e_0
libnpp 11.7.4.75 0 nvidia libnvjpeg 11.8.0.2 0 nvidia libpng 1.6.39 h5eee18b_0
libstdcxx-ng 11.2.0 h1234567_1
libtasn1 4.16.0 h27cfd23_0
libtiff 4.5.0 h6a678d5_2
libunistring 0.9.10 h27cfd23_0
libwebp 1.2.4 h11a3e52_1
libwebp-base 1.2.4 h5eee18b_1
lz4-c 1.9.4 h6a678d5_0
mkl 2021.4.0 h06a4308_640
mkl-service 2.4.0 py39h7f8727e_0
mkl_fft 1.3.1 py39hd3c417c_0
mkl_random 1.2.2 py39h51133e4_0
ncurses 6.4 h6a678d5_0
nettle 3.7.3 hbbd107a_1
numpy 1.23.5 py39h14f4228_0
numpy-base 1.23.5 py39h31eccc5_0
openh264 2.1.1 h4ff587b_0
openssl 1.1.1t h7f8727e_0
packaging 23.0 pypi_0 pypi pillow 9.4.0 py39h6a678d5_0
pip 23.0.1 py39h06a4308_0
psutil 5.9.4 pypi_0 pypi pycparser 2.21 pyhd3eb1b0_0
pyopenssl 23.0.0 py39h06a4308_0
pysocks 1.7.1 py39h06a4308_0
python 3.9.16 h7a1cb2a_2
pytorch 1.13.1 py3.9_cuda11.7_cudnn8.5.0_0 pytorch pytorch-cuda 11.7 h778d358_3 pytorch pytorch-mutex 1.0 cuda pytorch pyyaml 6.0 pypi_0 pypi readline 8.2 h5eee18b_0
regex 2023.3.23 pypi_0 pypi requests 2.28.1 py39h06a4308_1
sentencepiece 0.1.97 pypi_0 pypi setuptools 65.6.3 py39h06a4308_0
six 1.16.0 pyhd3eb1b0_1
sqlite 3.41.1 h5eee18b_0
tk 8.6.12 h1ccaba5_0
tokenizers 0.13.2 pypi_0 pypi torchaudio 0.13.1 py39_cu117 pytorch torchvision 0.14.1 py39_cu117 pytorch tqdm 4.65.0 pypi_0 pypi transformers 4.28.0.dev0 pypi_0 pypi typing_extensions 4.4.0 py39h06a4308_0
tzdata 2022g h04d1e81_0
urllib3 1.26.15 py39h06a4308_0
wheel 0.38.4 py39h06a4308_0
xz 5.2.10 h5eee18b_1
zlib 1.2.13 h5eee18b_0
zstd 1.5.4 hc292b87_0

rmivdc commented 1 year ago

@mudomau Do you have the same issue with "decapoda-research/llama-7b-hf" ?

I'm encountering another error now but the last Dockerfile install uploaded 3 days ago fixed that cuBLAS error for me.

samuelcardoso commented 1 year ago

same problem here.

trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
A: torch.Size([5120, 4096]), B: torch.Size([4096, 4096]), C: (5120, 4096); (lda, ldb, ldc): (c_int(163840), c_int(131072), c_int(163840)); (m, n, k): (c_int(5120), c_int(4096), c_int(4096))
cuBLAS API failed with status 15
error detected

$ nvidia-smi
Tue Apr 11 21:25:11 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03   Driver Version: 510.108.03   CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:08:00.0  On |                  N/A |
|  0%   53C    P8    18W / 220W |   1020MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1292      G   /usr/lib/xorg/Xorg                460MiB |
|    0   N/A  N/A      1577      G   /usr/bin/gnome-shell              172MiB |
|    0   N/A  N/A      3884      G   ...RendererForSitePerProcess       86MiB |
|    0   N/A  N/A      5441      G   ...983706979455292193,131072      249MiB |
+-----------------------------------------------------------------------------+

$ /usr/local/cuda-11.6/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Fri_Dec_17_18:16:03_PST_2021
Cuda compilation tools, release 11.6, V11.6.55
Build cuda_11.6.r11.6/compiler.30794723_0

$ pip list
Package                  Version
------------------------ -------------
accelerate               0.18.0
aiofiles                 23.1.0
aiohttp                  3.8.4
aiosignal                1.3.1
altair                   4.2.2
anyio                    3.6.2
appdirs                  1.4.4
apturl                   0.5.2
asttokens                2.2.1
async-timeout            4.0.2
attrs                    22.2.0
backcall                 0.2.0
bitsandbytes             0.37.2
black                    23.3.0
blinker                  1.4
Brlapi                   0.8.3
certifi                  2020.6.20
chardet                  4.0.0
charset-normalizer       3.1.0
click                    8.0.3
cmake                    3.26.3
colorama                 0.4.4
command-not-found        0.3
contourpy                1.0.7
cryptography             3.4.8
cupshelpers              1.0
cycler                   0.11.0
datasets                 2.11.0
dbus-python              1.2.18
decorator                5.1.1
defer                    1.0.6
dill                     0.3.6
distro                   1.7.0
distro-info              1.1build1
entrypoints              0.4
executing                1.2.0
fastapi                  0.95.0
ffmpy                    0.3.0
filelock                 3.11.0
fire                     0.5.0
fonttools                4.39.3
frozenlist               1.3.3
fsspec                   2023.4.0
GPUtil                   1.4.0
gradio                   3.25.0
gradio_client            0.0.10
h11                      0.14.0
httpcore                 0.17.0
httplib2                 0.20.2
httpx                    0.24.0
huggingface-hub          0.13.4
idna                     3.3
importlib-metadata       4.6.4
ipython                  8.12.0
jedi                     0.18.2
jeepney                  0.7.1
Jinja2                   3.1.2
jsonschema               4.17.3
keyring                  23.5.0
kiwisolver               1.4.4
language-selector        0.1
launchpadlib             1.10.16
lazr.restfulclient       0.14.4
lazr.uri                 1.0.6
linkify-it-py            2.0.0
lit                      16.0.1
llvmlite                 0.39.1
loralib                  0.1.1
louis                    3.20.0
macaroonbakery           1.3.1
markdown-it-py           2.2.0
MarkupSafe               2.1.2
matplotlib               3.7.1
matplotlib-inline        0.1.6
mdit-py-plugins          0.3.3
mdurl                    0.1.2
more-itertools           8.10.0
mpmath                   1.3.0
multidict                6.0.4
multiprocess             0.70.14
mypy-extensions          1.0.0
netifaces                0.11.0
networkx                 3.1
numba                    0.56.4
numpy                    1.23.5
nvidia-cublas-cu11       11.10.3.66
nvidia-cuda-cupti-cu11   11.7.101
nvidia-cuda-nvrtc-cu11   11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11        8.5.0.96
nvidia-cufft-cu11        10.9.0.58
nvidia-curand-cu11       10.2.10.91
nvidia-cusolver-cu11     11.4.0.1
nvidia-cusparse-cu11     11.7.4.91
nvidia-nccl-cu11         2.14.3
nvidia-nvtx-cu11         11.7.91
oauthlib                 3.2.0
olefile                  0.46
orjson                   3.8.10
packaging                23.0
pandas                   2.0.0
parso                    0.8.3
pathspec                 0.11.1
peft                     0.3.0.dev0
pexpect                  4.8.0
pickleshare              0.7.5
Pillow                   9.0.1
pip                      22.0.2
platformdirs             3.2.0
prompt-toolkit           3.0.38
protobuf                 3.12.4
psutil                   5.9.4
ptyprocess               0.7.0
pure-eval                0.2.2
pyarrow                  11.0.0
pycairo                  1.20.1
pycups                   2.0.1
pydantic                 1.10.7
pydub                    0.25.1
Pygments                 2.15.0
PyGObject                3.42.1
PyJWT                    2.3.0
pymacaroons              0.13.0
PyNaCl                   1.5.0
pynvml                   11.5.0
pyparsing                2.4.7
pyRFC3339                1.1
pyrsistent               0.19.3
python-apt               2.4.0+ubuntu1
python-dateutil          2.8.2
python-debian            0.1.43ubuntu1
python-multipart         0.0.6
pytz                     2022.1
pyxdg                    0.27
PyYAML                   5.4.1
regex                    2023.3.23
reportlab                3.6.8
requests                 2.25.1
responses                0.18.0
rich                     13.3.3
screen-resolution-extra  0.0.0
SecretStorage            3.3.1
semantic-version         2.10.0
sentencepiece            0.1.97
setuptools               59.6.0
six                      1.16.0
sniffio                  1.3.0
stack-data               0.6.2
starlette                0.26.1
sympy                    1.11.1
systemd-python           234
termcolor                2.2.0
tokenize-rt              5.0.0
tokenizers               0.13.3
tomli                    2.0.1
toolz                    0.12.0
torch                    1.13.1+cu116
torchaudio               0.13.1+cu116
torchvision              0.14.1+cu116
tqdm                     4.65.0
traitlets                5.9.0
transformers             4.28.0.dev0
triton                   2.0.0
typing_extensions        4.5.0
tzdata                   2023.3
ubuntu-advantage-tools   8001
ubuntu-drivers-common    0.0.0
uc-micro-py              1.0.1
ufw                      0.36.1
unattended-upgrades      0.1
urllib3                  1.26.5
uvicorn                  0.21.1
wadllib                  1.3.6
wcwidth                  0.2.6
websockets               11.0.1
wheel                    0.37.1
xdg                      5
xkit                     0.0.0
xxhash                   3.2.0
yarl                     1.8.2
zipp                     1.0.0

arvindsun commented 1 year ago

I am running into the same issue as well on a H100:

torch 1.13.1, bitsandbytes==0.38.1, cuda 11.8, python 3.10, cublas 11.11.3.6


    result = super().forward(x)
  File "/home/arvind/.local/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 320, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/home/arvind/.local/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 500, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/home/arvind/.local/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 397, in forward
    out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
  File "/home/arvind/.local/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1436, in igemmlt
    raise Exception('cublasLt ran into an error!')

> nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

SVEEu commented 1 year ago

Same issue comes to me when finetuning 30b and 65b models, even on different clouds.

For 65b model, it randomly occurs with a probability of about 70%. For 30b model, it occurs every time.

Malfaro43 commented 1 year ago

I am running into the same issue as well on a H100:

torch 1.13.1, bitsandbytes==0.38.1, cuda 11.8, python 3.10, cublas 11.11.3.6


    result = super().forward(x)

  File "/home/arvind/.local/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 320, in forward

    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)

  File "/home/arvind/.local/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 500, in matmul

    return MatMul8bitLt.apply(A, B, out, bias, state)

  File "/home/arvind/.local/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 397, in forward

    out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)

  File "/home/arvind/.local/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1436, in igemmlt

    raise Exception('cublasLt ran into an error!')


> nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver

Copyright (c) 2005-2022 NVIDIA Corporation

Built on Wed_Sep_21_10:33:58_PDT_2022

Cuda compilation tools, release 11.8, V11.8.89

Build cuda_11.8.r11.8/compiler.31833905_0

@arvindsun Have you fixed this? I'm also running into this issue when using an H100 on Lambda Labs.

daniel-furman commented 1 year ago

Getting the same error on an H100 on Lambda Labs

jonataslaw commented 1 year ago

Getting the same error on an H100 on Lambda Labs too

leehanchung commented 1 year ago

Getting the same error on an H100 on Lambda Labs too

Try to run it w/o 8-bit mode since you are on H100

jonataslaw commented 1 year ago

Getting the same error on an H100 on Lambda Labs too

Try to run it w/o 8-bit mode since you are on H100

I tried it.

Lambda instances of H100 has cuda 11.8, and pytorch 2.0.1 compiled to 117, which is not compatible. the bitsandbytes version also has a problem, and you need to rename the cuda version you are using.

I tried to install cuda version 12 too, to use the latest version of torch, but strangely the installation is aborted, without fail, so I gave up on testing it on the H100, I had already spent 3h of my time trying to configure it. I'll try it on another runpod instance, as locally I could successfully train it with 3 epochs, but I needed more computation to train it with 10, my RTX4090 will take weeks for it.

zubair-ahmed-ai commented 1 year ago

Facing the same error on lambda labds H100 instance trying to load Falcon-40B in 8 bit, what's the solution?

jonataslaw commented 1 year ago

export this variables:

export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Install the compatible cuda (11.7 hasn't support to H100):

sudo apt install cuda-nvcc-11-8 libcusparse-11-8 libcusparse-dev-11-8 libcublas-dev-11-8 libcublas-11-8 libcusolver-dev-11-8 libcusolver-11-8

Remove old cuda:

apt remove cuda-nvcc-11-7

Install the compatible pytorch:

pip install torch==2.0.0+cu118 torchvision==0.15.1+cu118 torchaudio==2.0.0 --extra-index-url https://download.pytorch.org/whl/cu118

pip install pytorch-lightning==1.9.0

If you will use deepspeed to make CPU offload (it makes the train faster) you need:

pip install deepspeed==0.7.0

Edit these files (using VIM, nano, or SFPT) changing the import for inf from torch._six with import from math

/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/runtime/utils.py

/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py

Thytu commented 1 year ago

Facing the same error on lambda labds H100 instance trying to load Falcon-40B in 8 bit, what's the solution?

Ended up moving back to an A100 😅

daniel-furman commented 1 year ago

Has anyone else tried and confirmed the efficacy of @jonataslaw's solution two comments above? Will test myself over the weekend.

daniel-furman commented 1 year ago

I was able to solve this error with the conda install approach found here: https://github.com/TimDettmers/bitsandbytes/issues/85

# jupyter setup
wget http://repo.continuum.io/archive/Anaconda3-2023.03-1-Linux-x86_64.sh
bash Anaconda3-2023.03-1-Linux-x86_64.sh
source ~/.bashrc

conda create --name cap
conda activate cap
conda install pip
conda install cudatoolkit
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH

git clone https://github.com/timdettmers/bitsandbytes.git
cd bitsandbytes
CUDA_VERSION=118 make cuda11x
python setup.py install

pip install scipy
python -m bitsandbytes
# should be successfull build

huawei-lin commented 1 year ago

I met this issue on H100 GPU, and fixed it by changing load_in_8bit=True to load_in_8bit=False in the 114-th line of finetune.py.

zubair-ahmed-ai commented 1 year ago

@daniel-furman

I was able to solve this error with the conda install approach found here: TimDettmers/bitsandbytes#85

# jupyter setup
wget http://repo.continuum.io/archive/Anaconda3-2023.03-1-Linux-x86_64.sh
bash Anaconda3-2023.03-1-Linux-x86_64.sh
source ~/.bashrc

conda create --name cap
conda activate cap
conda install pip
conda install cudatoolkit
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH

git clone https://github.com/timdettmers/bitsandbytes.git
cd bitsandbytes
CUDA_VERSION=118 make cuda11x
python setup.py install

pip install scipy
python -m bitsandbytes
# should be successfull build

Sadly it gave me the below error

Downloading (…)fetensors.index.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 36.2k/36.2k [00:00<00:00, 10.6MB/s]
Downloading (…)of-00004.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.96G/9.96G [03:00<00:00, 55.3MB/s]
Downloading (…)of-00004.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.86G/9.86G [02:57<00:00, 55.4MB/s]
Downloading (…)of-00004.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.86G/9.86G [02:57<00:00, 55.4MB/s]
Downloading (…)of-00004.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.36G/1.36G [00:24<00:00, 55.2MB/s]
Downloading shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [09:22<00:00, 140.63s/it]

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /home/ubuntu/miniconda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
/home/ubuntu/miniconda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/home/ubuntu/miniconda/envs/starchat/lib/libcudart.so'), PosixPath('/home/ubuntu/miniconda/envs/starchat/lib/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)
CUDA SETUP: CUDA runtime path found: /home/ubuntu/miniconda/envs/starchat/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 9.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/ubuntu/miniconda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
Loading checkpoint shards:   0%|                                                                                                                                                 | 0/4 [00:00<?, ?it/s]Error named symbol not found at line 528 in file /mmfs1/gscratch/zlab/timdettmers/git/bitsandbytes/csrc/ops.cu

Jacobsolawetz commented 1 year ago

Got this issue on H100 on runpod

HaishuoFang commented 1 year ago

same got this on H100 with 8bit. H100 works with 16bits

jieWANGforwork commented 11 months ago

Got this error on H100 using 8bit Llama. If anyone can make it on H100?

huawei-lin commented 11 months ago

Got this error on H100 using 8bit Llama. If anyone can make it on H100?

You can avoid to use 8 bit. 4bit and 16bit are fine.

tloen / alpaca-lora

cuBLAS API failed with status 15 - Error #174