hollowstrawberry / kohya-colab

Accessible Google Colab notebooks for Stable Diffusion Lora training, based on the work of kohya-ss and Linaqruf
GNU General Public License v3.0
599 stars 87 forks source link

Cuda not working #69

Closed guy907223982 closed 9 months ago

guy907223982 commented 9 months ago

CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64... CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 7.5 CUDA SETUP: Detected CUDA version 122 CUDA SETUP: TODO: compile library for specific version: libbitsandbytes_cuda122.so CUDA SETUP: Defaulting to libbitsandbytes.so... CUDA SETUP: CUDA detection failed. Either CUDA driver not installed, CUDA not installed, or you have multiple conflicting CUDA libraries! CUDA SETUP: If you compiled from source, try again with make CUDA_VERSION=DETECTED_CUDA_VERSION for example, make CUDA_VERSION=113. ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /content/kohya-trainer/train_network.py:873 in │ │ │ │ 870 │ args = parser.parse_args() │ │ 871 │ args = train_util.read_config_from_file(args, parser) │ │ 872 │ │ │ ❱ 873 │ train(args) │ │ 874 │ │ │ │ /content/kohya-trainer/train_network.py:262 in train │ │ │ │ 259 │ │ ) │ │ 260 │ │ trainable_params = network.prepare_optimizer_params(args.text_encoder_lr, args.u │ │ 261 │ │ │ ❱ 262 │ optimizer_name, optimizer_args, optimizer = train_util.get_optimizer(args, trainable │ │ 263 │ │ │ 264 │ # dataloaderを準備する │ │ 265 │ # DataLoaderのプロセス数:0はメインプロセスになる │ │ │ │ /content/kohya-trainer/library/train_util.py:2700 in get_optimizer │ │ │ │ 2697 │ │ │ 2698 │ if optimizer_type == "AdamW8bit".lower(): │ │ 2699 │ │ try: │ │ ❱ 2700 │ │ │ import bitsandbytes as bnb │ │ 2701 │ │ except ImportError: │ │ 2702 │ │ │ raise ImportError("No bitsand bytes / bitsandbytesがインストールされていない │ │ 2703 │ │ print(f"use 8-bit AdamW optimizer | {optimizer_kwargs}") │ │ │ │ /usr/local/lib/python3.10/dist-packages/bitsandbytes/init.py:6 in │ │ │ │ 3 # This source code is licensed under the MIT license found in the │ │ 4 # LICENSE file in the root directory of this source tree. │ │ 5 │ │ ❱ 6 from .autograd._functions import ( │ │ 7 │ MatmulLtState, │ │ 8 │ bmm_cublas, │ │ 9 │ matmul, │ │ │ │ /usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py:5 in │ │ │ │ 2 import warnings │ │ 3 │ │ 4 import torch │ │ ❱ 5 import bitsandbytes.functional as F │ │ 6 │ │ 7 from dataclasses import dataclass │ │ 8 from functools import reduce # Required in Python 3 │ │ │ │ /usr/local/lib/python3.10/dist-packages/bitsandbytes/functional.py:13 in │ │ │ │ 10 from typing import Tuple │ │ 11 from torch import Tensor │ │ 12 │ │ ❱ 13 from .cextension import COMPILED_WITH_CUDA, lib │ │ 14 from functools import reduce # Required in Python 3 │ │ 15 │ │ 16 # math.prod not compatible with python < 3.8 │ │ │ │ /usr/local/lib/python3.10/dist-packages/bitsandbytes/cextension.py:41 in │ │ │ │ 38 │ │ return cls._instance │ │ 39 │ │ 40 │ │ ❱ 41 lib = CUDALibrary_Singleton.get_instance().lib │ │ 42 try: │ │ 43 │ lib.cadam32bit_g32 │ │ 44 │ lib.get_context.restype = ct.c_void_p │ │ │ │ /usr/local/lib/python3.10/dist-packages/bitsandbytes/cextension.py:37 in get_instance │ │ │ │ 34 │ def get_instance(cls): │ │ 35 │ │ if cls._instance is None: │ │ 36 │ │ │ cls._instance = cls.new(cls) │ │ ❱ 37 │ │ │ cls._instance.initialize() │ │ 38 │ │ return cls._instance │ │ 39 │ │ 40 │ │ │ │ /usr/local/lib/python3.10/dist-packages/bitsandbytes/cextension.py:27 in initialize │ │ │ │ 24 │ │ │ if not binary_path.exists(): │ │ 25 │ │ │ │ print('CUDA SETUP: CUDA detection failed. Either CUDA driver not install │ │ 26 │ │ │ │ print('CUDA SETUP: If you compiled from source, try again with `make CUD │ │ ❱ 27 │ │ │ │ raise Exception('CUDA SETUP: Setup Failed!') │ │ 28 │ │ │ self.lib = ct.cdll.LoadLibrary(binary_path) │ │ 29 │ │ else: │ │ 30 │ │ │ print(f"CUDA SETUP: Loading binary {binary_path}...") │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ Exception: CUDA SETUP: Setup Failed! ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /usr/local/bin/accelerate:8 in │ │ │ │ 5 from accelerate.commands.accelerate_cli import main │ │ 6 if name == 'main': │ │ 7 │ sys.argv[0] = re.sub(r'(-script.pyw|.exe)?$', '', sys.argv[0]) │ │ ❱ 8 │ sys.exit(main()) │ │ 9 │ │ │ │ /usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py:45 in main │ │ │ │ 42 │ │ exit(1) │ │ 43 │ │ │ 44 │ # Run │ │ ❱ 45 │ args.func(args) │ │ 46 │ │ 47 │ │ 48 if name == "main": │ │ │ │ /usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py:1104 in launch_command │ │ │ │ 1101 │ elif defaults is not None and defaults.compute_environment == ComputeEnvironment.AMA │ │ 1102 │ │ sagemaker_launcher(defaults, args) │ │ 1103 │ else: │ │ ❱ 1104 │ │ simple_launcher(args) │ │ 1105 │ │ 1106 │ │ 1107 def main(): │ │ │ │ /usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py:567 in simple_launcher │ │ │ │ 564 │ process = subprocess.Popen(cmd, env=current_env) │ │ 565 │ process.wait() │ │ 566 │ if process.returncode != 0: │ │ ❱ 567 │ │ raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) │ │ 568 │ │ 569 │ │ 570 def multi_gpu_launcher(args): │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ CalledProcessError: Command '['/usr/bin/python3', 'train_network.py', '--dataset_config=/content/drive/MyDrive/Loras/ACB/dataset_config.toml', '--config_file=/content/drive/MyDrive/Loras/ACB/training_config.toml']' returned non-zero exit status 1.

hollowstrawberry commented 9 months ago

Perhaps you ran out of GPU time for the week?

guy907223982 commented 9 months ago

Perhaps you ran out of GPU time for the week?

I didn't think about that

7wpanc24 commented 9 months ago

Same issue here. Definitely not a GPU time issue. Haven't used any in over a month.

hollowstrawberry commented 9 months ago

Same issue here. Definitely not a GPU time issue. Haven't used any in over a month.

When did this start happening?

7wpanc24 commented 9 months ago

I've only just encountered it, but then I haven't used the notebook in weeks.

hollowstrawberry commented 9 months ago

I can confirm this happens every time starting today. Seems Colab updated their libraries again. Every time they do this it becomes trickier...

I'll take a look

7wpanc24 commented 9 months ago

thank you!

3djedi commented 9 months ago

same issue here today. Used yesterday with no issues. `===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link

CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64... CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 7.5 CUDA SETUP: Detected CUDA version 122 CUDA SETUP: TODO: compile library for specific version: libbitsandbytes_cuda122.so CUDA SETUP: Defaulting to libbitsandbytes.so... CUDA SETUP: CUDA detection failed. Either CUDA driver not installed, CUDA not installed, or you have multiple conflicting CUDA libraries! CUDA SETUP: If you compiled from source, try again with make CUDA_VERSION=DETECTED_CUDA_VERSION for example, make CUDA_VERSION=113. ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /content/kohya-trainer/train_network.py:873 in │ │ │ │ 870 │ args = parser.parse_args() │ │ 871 │ args = train_util.read_config_from_file(args, parser) │ │ 872 │ │ │ ❱ 873 │ train(args) │ │ 874 │ │ │ │ /content/kohya-trainer/train_network.py:262 in train │ │ │ │ 259 │ │ ) │ │ 260 │ │ trainable_params = network.prepare_optimizer_params(args.text_encoder_lr, args.u │ │ 261 │ │ │ ❱ 262 │ optimizer_name, optimizer_args, optimizer = train_util.get_optimizer(args, trainable │ │ 263 │ │ │ 264 │ # dataloaderを準備する │ │ 265 │ # DataLoaderのプロセス数:0はメインプロセスになる │ │ │ │ /content/kohya-trainer/library/train_util.py:2700 in get_optimizer │ │ │ │ 2697 │ │ │ 2698 │ if optimizer_type == "AdamW8bit".lower(): │ │ 2699 │ │ try: │ │ ❱ 2700 │ │ │ import bitsandbytes as bnb │ │ 2701 │ │ except ImportError: │ │ 2702 │ │ │ raise ImportError("No bitsand bytes / bitsandbytesがインストールされていない │ │ 2703 │ │ print(f"use 8-bit AdamW optimizer | {optimizer_kwargs}") │ │ │ │ /usr/local/lib/python3.10/dist-packages/bitsandbytes/init.py:6 in │ │ │ │ 3 # This source code is licensed under the MIT license found in the │ │ 4 # LICENSE file in the root directory of this source tree. │ │ 5 │ │ ❱ 6 from .autograd._functions import ( │ │ 7 │ MatmulLtState, │ │ 8 │ bmm_cublas, │ │ 9 │ matmul, │ │ │ │ /usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py:5 in │ │ │ │ 2 import warnings │ │ 3 │ │ 4 import torch │ │ ❱ 5 import bitsandbytes.functional as F │ │ 6 │ │ 7 from dataclasses import dataclass │ │ 8 from functools import reduce # Required in Python 3 │ │ │ │ /usr/local/lib/python3.10/dist-packages/bitsandbytes/functional.py:13 in │ │ │ │ 10 from typing import Tuple │ │ 11 from torch import Tensor │ │ 12 │ │ ❱ 13 from .cextension import COMPILED_WITH_CUDA, lib │ │ 14 from functools import reduce # Required in Python 3 │ │ 15 │ │ 16 # math.prod not compatible with python < 3.8 │ │ │ │ /usr/local/lib/python3.10/dist-packages/bitsandbytes/cextension.py:41 in │ │ │ │ 38 │ │ return cls._instance │ │ 39 │ │ 40 │ │ ❱ 41 lib = CUDALibrary_Singleton.get_instance().lib │ │ 42 try: │ │ 43 │ lib.cadam32bit_g32 │ │ 44 │ lib.get_context.restype = ct.c_void_p │ │ │ │ /usr/local/lib/python3.10/dist-packages/bitsandbytes/cextension.py:37 in get_instance │ │ │ │ 34 │ def get_instance(cls): │ │ 35 │ │ if cls._instance is None: │ │ 36 │ │ │ cls._instance = cls.new(cls) │ │ ❱ 37 │ │ │ cls._instance.initialize() │ │ 38 │ │ return cls._instance │ │ 39 │ │ 40 │ │ │ │ /usr/local/lib/python3.10/dist-packages/bitsandbytes/cextension.py:27 in initialize │ │ │ │ 24 │ │ │ if not binary_path.exists(): │ │ 25 │ │ │ │ print('CUDA SETUP: CUDA detection failed. Either CUDA driver not install │ │ 26 │ │ │ │ print('CUDA SETUP: If you compiled from source, try again with make CUD │ │ ❱ 27 │ │ │ │ raise Exception('CUDA SETUP: Setup Failed!') │ │ 28 │ │ │ self.lib = ct.cdll.LoadLibrary(binary_path) │ │ 29 │ │ else: │ │ 30 │ │ │ print(f"CUDA SETUP: Loading binary {binary_path}...") │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ Exception: CUDA SETUP: Setup Failed! ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /usr/local/bin/accelerate:8 in <module> │ │ │ │ 5 from accelerate.commands.accelerate_cli import main │ │ 6 if __name__ == '__main__': │ │ 7 │ sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0]) │ │ ❱ 8 │ sys.exit(main()) │ │ 9 │ │ │ │ /usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py:45 in main │ │ │ │ 42 │ │ exit(1) │ │ 43 │ │ │ 44 │ # Run │ │ ❱ 45 │ args.func(args) │ │ 46 │ │ 47 │ │ 48 if __name__ == "__main__": │ │ │ │ /usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py:1104 in launch_command │ │ │ │ 1101 │ elif defaults is not None and defaults.compute_environment == ComputeEnvironment.AMA │ │ 1102 │ │ sagemaker_launcher(defaults, args) │ │ 1103 │ else: │ │ ❱ 1104 │ │ simple_launcher(args) │ │ 1105 │ │ 1106 │ │ 1107 def main(): │ │ │ │ /usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py:567 in simple_launcher │ │ │ │ 564 │ process = subprocess.Popen(cmd, env=current_env) │ │ 565 │ process.wait() │ │ 566 │ if process.returncode != 0: │ │ ❱ 567 │ │ raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) │ │ 568 │ │ 569 │ │ 570 def multi_gpu_launcher(args): │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ CalledProcessError: Command '['/usr/bin/python3', 'train_network.py', '--dataset_config=/content/drive/MyDrive/Loras/67Impala/dataset_config.toml', '--config_file=/content/drive/MyDrive/Loras/67Impala/training_config.toml']' returned non-zero exit status 1.

Kuroseji commented 9 months ago

Just as a data point, this was working five hours ago. Best of luck fixing this.

dante-teo commented 9 months ago

I can confirm this happens every time starting today. Seems Colab updated their libraries again. Every time they do this it becomes trickier...

I'll take a look

Thanks for putting efforts on this.

hollowstrawberry commented 9 months ago

It comes down to this:

CUDA backend failed to initialize: Found CUDA version 12010, but JAX was built against version 12020, which is newer. The copy of CUDA that is installed must be at least as new as the version against which JAX was built. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

I can't find a way to update CUDA or downgrade JAX properly.

If someone could help, we would all be thankful.

dante-teo commented 9 months ago

It comes down to this:

CUDA backend failed to initialize: Found CUDA version 12010, but JAX was built against version 12020, which is newer. The copy of CUDA that is installed must be at least as new as the version against which JAX was built. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

I can't find a way to update CUDA or downgrade JAX properly.

If someone could help, we would all be thankful.

Couldn't we just downgrade the JAX in terminal? If we have colab pro.

hollowstrawberry commented 9 months ago

You don't need the colab pro terminal for that. Just need the right command.

dante-teo commented 9 months ago

command

Yeah I just noticed that, the command below is not working:

pip install -U "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
ddddfre commented 9 months ago

The program is still not working

PlumButa commented 9 months ago

I tried several notes on training LoRA on Colab, and they all had the same problem, regarding CUDA... If anyone could figure it out, I think it would be a really great thing. 😣

cecilshaw15 commented 9 months ago

A friend said it started to work after running this command:

!pip install --upgrade bitsandbytes

Haven't try it myself but i'll share anyways.

dante-teo commented 9 months ago

A friend said it started to work after running this command:

!pip install --upgrade bitsandbytes

Haven't try it myself but i'll share anyways.

It works! Thanks a lot for sharing!

PlumButa commented 9 months ago

A friend said it started to work after running this command: !pip install --upgrade bitsandbytes Haven't try it myself but i'll share anyways.

It works! Thanks a lot for sharing!

Where to put the command?

dante-teo commented 9 months ago

A friend said it started to work after running this command: !pip install --upgrade bitsandbytes Haven't try it myself but i'll share anyways.

It works! Thanks a lot for sharing!

Where to put the command?

I put it at the bottom of install dependencies function, as attached:

image
3djedi commented 9 months ago

I can confirm that the suggested addition works as described

Installing collected packages: bitsandbytes Attempting uninstall: bitsandbytes Found existing installation: bitsandbytes 0.35.0 Uninstalling bitsandbytes-0.35.0: Successfully uninstalled bitsandbytes-0.35.0 Successfully installed bitsandbytes-0.41.3.post2

✅ Installation finished in 148 seconds.

DEX-1101 commented 9 months ago

its actually worked Screenshot_648 thankyou

hollowstrawberry commented 9 months ago

A friend said it started to work after running this command:

!pip install --upgrade bitsandbytes

Haven't try it myself but i'll share anyways.

Thank you lots. I have added the upgraded bitsandbytes version to the requirements. The trainer is working again, no changes needed as of right now.

ieya114 commented 9 months ago

I'm still not getting it, what can I do?I'm still not getting it, what can I do?I'm still not getting it, what can I do?I'm still not getting it, what can I do?I'm still not getting it, what can I do?I'm still not getting it, what can I do?

PlumButa commented 9 months ago

I'm still not getting it, what can I do?I'm still not getting it, what can I do?I'm still not getting it, what can I do?I'm still not getting it, what can I do?I'm still not getting it, what can I do?I'm still not getting it, what can I do?

No need to do anything anymore, just go to the new lora training colab note link to use it, hollowstrawberry has been updated.