hollowstrawberry / kohya-colab

Accessible Google Colab notebooks for Stable Diffusion Lora training, based on the work of kohya-ss and Linaqruf
GNU General Public License v3.0
599 stars 87 forks source link

is the lora trainer 512 error again ? CUDA backend failed to initialize: Found CUDA version 12010, #102

Closed ridhoyp closed 6 months ago

ridhoyp commented 6 months ago

is it error again?

image

MyDrive/Loras/test_lora/dataset ๐Ÿ“ˆ Found 95 images with 2 repeats, equaling 190 steps. ๐Ÿ“‰ Divide 190 steps by 2 batch size to get 95.0 steps per epoch. ๐Ÿ”ฎ There will be 10 epochs, for around 950 total training steps.

โœ… Dependencies already installed.

๐Ÿ”„ Model already downloaded.

๐Ÿ“„ Config saved to /content/drive/MyDrive/Loras/test_lora/training_config.toml ๐Ÿ“„ Dataset config saved to /content/drive/MyDrive/Loras/test_lora/dataset_config.toml

โญ Starting trainer...

CUDA backend failed to initialize: Found CUDA version 12010, but JAX was built against version 12020, which is newer. The copy of CUDA that is installed must be at least as new as the version against which JAX was built. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.) Loading settings from /content/drive/MyDrive/Loras/test_lora/training_config.toml... /content/drive/MyDrive/Loras/test_lora/training_config prepare tokenizer update token length: 225 Loading dataset config from /content/drive/MyDrive/Loras/testl_lora/dataset_config.toml prepare images. found directory /content/drive/MyDrive/Loras/test_lora/dataset contains 95 image files 190 train images with repeating. 0 reg images. no regularization images / ๆญฃๅ‰‡ๅŒ–็”ปๅƒใŒ่ฆ‹ใคใ‹ใ‚Šใพใ›ใ‚“ใงใ—ใŸ [Dataset 0] batch_size: 2 resolution: (1024, 1024) enable_bucket: True min_bucket_reso: 320 max_bucket_reso: 1280 bucket_reso_steps: 64 bucket_no_upscale: False

[Subset 0 of Dataset 0] image_dir: "/content/drive/MyDrive/Loras/test_lora/dataset" image_count: 95 num_repeats: 2 shuffle_caption: True keep_tokens: 1 caption_dropout_rate: 0.0 caption_dropout_every_n_epoches: 0 caption_tag_dropout_rate: 0.0 color_aug: False flip_aug: True face_crop_aug_range: None random_crop: False token_warmup_min: 1, token_warmup_step: 0, is_reg: False class_tokens: None caption_extension: .txt

[Dataset 0] loading image sizes. 100% 95/95 [00:00<00:00, 408.66it/s] make buckets number of images (including repeats) / ๅ„bucketใฎ็”ปๅƒๆžšๆ•ฐ๏ผˆ็นฐใ‚Š่ฟ”ใ—ๅ›žๆ•ฐใ‚’ๅซใ‚€๏ผ‰ bucket 0: resolution (704, 1280), count: 4 bucket 1: resolution (768, 1280), count: 2 bucket 2: resolution (832, 1216), count: 34 bucket 3: resolution (896, 1152), count: 18 bucket 4: resolution (1024, 1024), count: 100 bucket 5: resolution (1152, 896), count: 4 bucket 6: resolution (1216, 832), count: 24 bucket 7: resolution (1280, 704), count: 4 mean ar error (without repeats): 0.012814058038616015 preparing accelerator

/content/kohya-trainer/train_network.py:991 in โ”‚ โ”‚ โ”‚ โ”‚ 988 โ”‚ args = train_util.read_config_from_file(args, parser) โ”‚ โ”‚ 989 โ”‚ โ”‚ โ”‚ 990 โ”‚ trainer = NetworkTrainer() โ”‚ โ”‚ โฑ 991 โ”‚ trainer.train(args) โ”‚ โ”‚ 992 โ”‚ โ”‚ โ”‚ โ”‚ /content/kohya-trainer/train_network.py:205 in train โ”‚ โ”‚ โ”‚ โ”‚ 202 โ”‚ โ”‚ โ”‚ โ”‚ 203 โ”‚ โ”‚ # acceleratorใ‚’ๆบ–ๅ‚™ใ™ใ‚‹ โ”‚ โ”‚ 204 โ”‚ โ”‚ print("preparing accelerator") โ”‚ โ”‚ โฑ 205 โ”‚ โ”‚ accelerator = train_util.prepare_accelerator(args) โ”‚ โ”‚ 206 โ”‚ โ”‚ is_main_process = accelerator.is_main_process โ”‚ โ”‚ 207 โ”‚ โ”‚ โ”‚ โ”‚ 208 โ”‚ โ”‚ # mixed precisionใซๅฏพๅฟœใ—ใŸๅž‹ใ‚’็”จๆ„ใ—ใฆใŠใ้ฉๅฎœcastใ™ใ‚‹ โ”‚ โ”‚ โ”‚ โ”‚ /content/kohya-trainer/library/train_util.py:3569 in prepare_accelerator โ”‚ โ”‚ โ”‚ โ”‚ 3566 โ”‚ โ”‚ โ”‚ if args.wandb_api_key is not None: โ”‚ โ”‚ 3567 โ”‚ โ”‚ โ”‚ โ”‚ wandb.login(key=args.wandb_api_key) โ”‚ โ”‚ 3568 โ”‚ โ”‚ โ”‚ โฑ 3569 โ”‚ accelerator = Accelerator( โ”‚ โ”‚ 3570 โ”‚ โ”‚ gradient_accumulation_steps=args.gradient_accumulation_steps, โ”‚ โ”‚ 3571 โ”‚ โ”‚ mixed_precision=args.mixed_precision, โ”‚ โ”‚ 3572 โ”‚ โ”‚ log_with=log_with, โ”‚ โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ TypeError: Accelerator.init() got an unexpected keyword argument 'project_dir' โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ โ”‚ /usr/local/bin/accelerate:8 in โ”‚ โ”‚ โ”‚ โ”‚ 5 from accelerate.commands.accelerate_cli import main โ”‚ โ”‚ 6 if name == 'main': โ”‚ โ”‚ 7 โ”‚ sys.argv[0] = re.sub(r'(-script.pyw|.exe)?$', '', sys.argv[0]) โ”‚ โ”‚ โฑ 8 โ”‚ sys.exit(main()) โ”‚ โ”‚ 9 โ”‚ โ”‚ โ”‚ โ”‚ /usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py:45 in main โ”‚ โ”‚ โ”‚ โ”‚ 42 โ”‚ โ”‚ exit(1) โ”‚ โ”‚ 43 โ”‚ โ”‚ โ”‚ 44 โ”‚ # Run โ”‚ โ”‚ โฑ 45 โ”‚ args.func(args) โ”‚ โ”‚ 46 โ”‚ โ”‚ 47 โ”‚ โ”‚ 48 if name == "main": โ”‚ โ”‚ โ”‚ โ”‚ /usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py:1104 in launch_command โ”‚ โ”‚ โ”‚ โ”‚ 1101 โ”‚ elif defaults is not None and defaults.compute_environment == ComputeEnvironment.AMA โ”‚ โ”‚ 1102 โ”‚ โ”‚ sagemaker_launcher(defaults, args) โ”‚ โ”‚ 1103 โ”‚ else: โ”‚ โ”‚ โฑ 1104 โ”‚ โ”‚ simple_launcher(args) โ”‚ โ”‚ 1105 โ”‚ โ”‚ 1106 โ”‚ โ”‚ 1107 def main(): โ”‚ โ”‚ โ”‚ โ”‚ /usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py:567 in simple_launcher โ”‚ โ”‚ โ”‚ โ”‚ 564 โ”‚ process = subprocess.Popen(cmd, env=current_env) โ”‚ โ”‚ 565 โ”‚ process.wait() โ”‚ โ”‚ 566 โ”‚ if process.returncode != 0: โ”‚ โ”‚ โฑ 567 โ”‚ โ”‚ raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) โ”‚ โ”‚ 568 โ”‚ โ”‚ 569 โ”‚ โ”‚ 570 def multi_gpu_launcher(args):

CalledProcessError: Command '['/usr/bin/python3', 'train_network.py', '--dataset_config=/content/drive/MyDrive/Loras/test_lora/dataset_config.toml', '--config_file=/content/drive/MyDrive/Loras/test_lora/training_config.toml']' returned non-zero exit status 1.

hollowstrawberry commented 6 months ago

I believe this happened while I was changing things trying to find a fix. The current problem appears to be #98 still.

ridhoyp commented 6 months ago

alright, thankyou for your hardworks.. :)