hollowstrawberry / kohya-colab

Accessible Google Colab notebooks for Stable Diffusion Lora training, based on the work of kohya-ss and Linaqruf
GNU General Public License v3.0
599 stars 87 forks source link

returned non-zero exit status 1. #93

Closed TheRyukenOmega16 closed 6 months ago

TheRyukenOmega16 commented 6 months ago

I have tried to start training a LoRA, but every time I try, this error always appears:

CUDA backend failed to initialize: Found CUDA version 12010, but JAX was built against version 12020, which is newer. The copy of CUDA that is installed must be at least as new as the version against which JAX was built. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.) ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /content/kohya-trainer/train_network.py:17 in │ │ │ │ 14 from accelerate.utils import set_seed │ │ 15 from diffusers import DDPMScheduler │ │ 16 │ │ ❱ 17 import library.train_util as train_util │ │ 18 from library.train_util import ( │ │ 19 │ DreamBoothDataset, │ │ 20 ) │ │ │ │ /content/kohya-trainer/library/train_util.py:36 in │ │ │ │ 33 import torch │ │ 34 from torch.nn.parallel import DistributedDataParallel as DDP │ │ 35 from torch.optim import Optimizer │ │ ❱ 36 from torchvision import transforms │ │ 37 from transformers import CLIPTokenizer │ │ 38 import transformers │ │ 39 import diffusers │ │ │ │ /usr/local/lib/python3.10/dist-packages/torchvision/init.py:6 in │ │ │ │ 3 from modulefinder import Module │ │ 4 │ │ 5 import torch │ │ ❱ 6 from torchvision import _meta_registrations, datasets, io, models, ops, transforms, util │ │ 7 │ │ 8 from .extension import _HAS_OPS │ │ 9 │ │ │ │ /usr/local/lib/python3.10/dist-packages/torchvision/_meta_registrations.py:164 in │ │ │ │ 161 │ │ 162 │ │ 163 @torch._custom_ops.impl_abstract("torchvision::nms") │ │ ❱ 164 def meta_nms(dets, scores, iou_threshold): │ │ 165 │ torch._check(dets.dim() == 2, lambda: f"boxes should be a 2d tensor, got {dets.dim() │ │ 166 │ torch._check(dets.size(1) == 4, lambda: f"boxes should have 4 elements in dimension │ │ 167 │ torch._check(scores.dim() == 1, lambda: f"scores should be a 1d tensor, got {scores. │ │ │ │ /usr/local/lib/python3.10/dist-packages/torch/_custom_ops.py:253 in inner │ │ │ │ 250 │ """ │ │ 251 │ │ │ 252 │ def inner(func): │ │ ❱ 253 │ │ custom_op = _find_custom_op(qualname, also_check_torch_library=True) │ │ 254 │ │ custom_op.impl_abstract(_stacklevel=3)(func) │ │ 255 │ │ return func │ │ 256 │ │ │ │ /usr/local/lib/python3.10/dist-packages/torch/_custom_op/impl.py:1076 in _find_custom_op │ │ │ │ 1073 │ │ raise RuntimeError( │ │ 1074 │ │ │ f"Could not find custom op \"{qualname}\". Did you register it via " │ │ 1075 │ │ │ f"the torch._custom_ops API?") │ │ ❱ 1076 │ overload = get_op(qualname) │ │ 1077 │ result = custom_op_from_existing(overload) │ │ 1078 │ return result │ │ 1079 │ │ │ │ /usr/local/lib/python3.10/dist-packages/torch/_custom_op/impl.py:1062 in get_op │ │ │ │ 1059 │ │ error_not_found() │ │ 1060 │ opnamespace = getattr(torch.ops, ns) │ │ 1061 │ if not hasattr(opnamespace, name): │ │ ❱ 1062 │ │ error_not_found() │ │ 1063 │ packet = getattr(opnamespace, name) │ │ 1064 │ if not hasattr(packet, 'default'): │ │ 1065 │ │ error_not_found() │ │ │ │ /usr/local/lib/python3.10/dist-packages/torch/_custom_op/impl.py:1052 in error_not_found │ │ │ │ 1049 │ │ 1050 def get_op(qualname): │ │ 1051 │ def error_not_found(): │ │ ❱ 1052 │ │ raise ValueError( │ │ 1053 │ │ │ f"Could not find the operator {qualname}. Please make sure you have " │ │ 1054 │ │ │ f"already registered the operator and (if registered from C++) " │ │ 1055 │ │ │ f"loaded it via torch.ops.load_library.") │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ ValueError: Could not find the operator torchvision::nms. Please make sure you have already registered the operator and (if registered from C++) loaded it via torch.ops.load_library. ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /usr/local/bin/accelerate:8 in │ │ │ │ 5 from accelerate.commands.accelerate_cli import main │ │ 6 if name == 'main': │ │ 7 │ sys.argv[0] = re.sub(r'(-script.pyw|.exe)?$', '', sys.argv[0]) │ │ ❱ 8 │ sys.exit(main()) │ │ 9 │ │ │ │ /usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py:45 in main │ │ │ │ 42 │ │ exit(1) │ │ 43 │ │ │ 44 │ # Run │ │ ❱ 45 │ args.func(args) │ │ 46 │ │ 47 │ │ 48 if name == "main": │ │ │ │ /usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py:1104 in launch_command │ │ │ │ 1101 │ elif defaults is not None and defaults.compute_environment == ComputeEnvironment.AMA │ │ 1102 │ │ sagemaker_launcher(defaults, args) │ │ 1103 │ else: │ │ ❱ 1104 │ │ simple_launcher(args) │ │ 1105 │ │ 1106 │ │ 1107 def main(): │ │ │ │ /usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py:567 in simple_launcher │ │ │ │ 564 │ process = subprocess.Popen(cmd, env=current_env) │ │ 565 │ process.wait() │ │ 566 │ if process.returncode != 0: │ │ ❱ 567 │ │ raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) │ │ 568 │ │ 569 │ │ 570 def multi_gpu_launcher(args): │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ CalledProcessError: Command '['/usr/bin/python3', 'train_network.py', '--dataset_config=/content/drive/MyDrive/Loras/Nejikawa-Raimu/dataset_config.toml', '--config_file=/content/drive/MyDrive/Loras/Nejikawa-Raimu/training_config.toml']' returned non-zero exit status 1.

In addition, the notebook takes about 30 minutes in the facilities before starting the training.

PD: Disculpen si mi Ingles es muy malo.

21x2-42 commented 6 months ago

Just encountered the same problem.

githubnoot commented 6 months ago

Same problem! Been trouble-shooting back and forth but no solution. I always assume it's user-error, but it seems to not be the case.

junwoochoi2 commented 6 months ago

Now fixed. Thank you 😀

githubnoot commented 6 months ago

Now fixed. Thank you 😀

I just tried again, but it's the same error. Any idea if you tried a new version or changed a setting?

junwoochoi2 commented 6 months ago

Now fixed. Thank you 😀

I just tried again, but it's the same error. Any idea if you tried a new version or changed a setting?

You are right. Output is perfectly broken