kohya-ss / sd-scripts

Apache License 2.0
5.19k stars 862 forks source link

segkill 9 when trying to train sdxl with sd-scripts #1065

Closed Reaper176 closed 9 months ago

Reaper176 commented 9 months ago

While attempting to train sdxl with sd-scripts i get this error

If any other information is needed please ask im not great with knowing what is important.

Traceback (most recent call last):
  File "/home/user/anaconda3/envs/sd-scripts/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/user/anaconda3/envs/sd-scripts/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/home/user/anaconda3/envs/sd-scripts/lib/python3.10/site-packages/accelerate/commands/launch.py", line 986, in launch_command
    simple_launcher(args)
  File "/home/user/anaconda3/envs/sd-scripts/lib/python3.10/site-packages/accelerate/commands/launch.py", line 628, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/user/anaconda3/envs/sd-scripts/bin/python', 'sdxl_train_network.py', '--logging_dir=logs', '--log_prefix=daedream-net-alpha-4-net-dim-8-2500steps', '--network_module=networks.lora', '--max_data_loader_n_workers=1', '--persistent_data_loader_workers', '--caption_extension=.txt', '--shuffle_caption', '--keep_tokens=0', '--max_token_length=225', '--prior_loss_weight=1', '--mixed_precision=fp16', '--save_precision=fp16', '--xformers', '--cache_latents', '--save_model_as=safetensors', '--train_data_dir=/home/user/kohya/LoRA-Datasets/daedream/images/', '--output_dir=/home/user/kohya/LoRA-Datasets/daedream/lora_tests/daedream-net-alpha-4-net-dim-8-2500steps_ver-a1.0', '--reg_data_dir=/home/user/kohya/LoRA-Datasets/daedream/reg/', '--pretrained_model_name_or_path=/home/user/stable-diffusion-webui/models/Stable-diffusion/sd_xl_base_1.0.safetensors', '--output_name=daedream-net-alpha-4-net-dim-8-2500steps_ver-a1.0_', '--learning_rate=.05', '--unet_lr=.05', '--text_encoder_lr=.05', '--max_train_steps=2500', '--save_every_n_steps=500', '--resolution=1024', '--enable_bucket', '--min_bucket_reso=836', '--max_bucket_reso=1254', '--train_batch_size=1', '--network_dim=8', '--network_alpha=4', '--optimizer_type=Prodigy', '--lr_scheduler=cosine_with_restarts', '--noise_offset=0.0005', '--seed=0', '--clip_skip=1', '--sample_every_n_steps=500', '--sample_prompts=/home/user/kohya/LoRA-Datasets/daedream/test_prompts.txt', '--sample_sampler=k_euler_a', '--gradient_accumulation_steps=1', '--min_snr_gamma=5']' died with <Signals.SIGKILL: 9>.

here is a full terminal

 (sd-scripts) user@comp:~/kohya/sd-scripts$ bash scripts-bash.sh

default paths
- base home dir: /home/user/kohya/LoRA-Datasets/daedream/
- image set dir: /home/user/kohya/LoRA-Datasets/daedream/images/

prepare tokenizers
update token length: 225
Using DreamBooth method.
prepare images.
found directory /home/user/kohya/LoRA-Datasets/daedream/images/1_daedream contains 37 image files
37 train images with repeating.
0 reg images.
no regularization images / 正則化画像が見つかりませんでした
[Dataset 0]
  batch_size: 1
  resolution: (512, 512)
  enable_bucket: True
  min_bucket_reso: 418
  max_bucket_reso: 627
  bucket_reso_steps: 64
  bucket_no_upscale: False

  [Subset 0 of Dataset 0]
    image_dir: "/home/user/kohya/LoRA-Datasets/daedream/images/1_daedream"
    image_count: 37
    num_repeats: 1
    shuffle_caption: True
    keep_tokens: 0
    caption_dropout_rate: 0.0
    caption_dropout_every_n_epoches: 0
    caption_tag_dropout_rate: 0.0
    caption_prefix: None
    caption_suffix: None
    color_aug: False
    flip_aug: False
    face_crop_aug_range: None
    random_crop: False
    token_warmup_min: 1,
    token_warmup_step: 0,
    is_reg: False
    class_tokens: daedream
    caption_extension: .txt

[Dataset 0]
loading image sizes.
100% 37/37 [00:00<00:00, 4805.07it/s]
make buckets
number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む)
bucket 0: resolution (418, 627), count: 1
bucket 1: resolution (448, 610), count: 7
bucket 2: resolution (482, 576), count: 6
bucket 3: resolution (512, 512), count: 3
bucket 4: resolution (512, 546), count: 7
bucket 5: resolution (546, 512), count: 5
bucket 6: resolution (576, 482), count: 2
bucket 7: resolution (610, 448), count: 2
bucket 8: resolution (627, 418), count: 4
mean ar error (without repeats): 0.04748498032883879
clip_skip will be unexpected / SDXL学習ではclip_skipは動作しません
preparing accelerator
loading model for process 0/1
load StableDiffusion checkpoint: /home/user/stable-diffusion-webui/models/Stable-diffusion/sd_xl_base_1.0.safetensors
building U-Net
loading U-Net from checkpoint
Traceback (most recent call last):
  File "/home/user/anaconda3/envs/sd-scripts/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/user/anaconda3/envs/sd-scripts/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/home/user/anaconda3/envs/sd-scripts/lib/python3.10/site-packages/accelerate/commands/launch.py", line 986, in launch_command
    simple_launcher(args)
  File "/home/user/anaconda3/envs/sd-scripts/lib/python3.10/site-packages/accelerate/commands/launch.py", line 628, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/user/anaconda3/envs/sd-scripts/bin/python', 'sdxl_train_network.py', '--logging_dir=logs', '--log_prefix=daedream-net-alpha-4-net-dim-8-2500steps', '--network_module=networks.lora', '--max_data_loader_n_workers=1', '--persistent_data_loader_workers', '--caption_extension=.txt', '--shuffle_caption', '--keep_tokens=0', '--max_token_length=225', '--prior_loss_weight=1', '--mixed_precision=fp16', '--save_precision=fp16', '--xformers', '--cache_latents', '--save_model_as=safetensors', '--train_data_dir=/home/user/kohya/LoRA-Datasets/daedream/images/', '--output_dir=/home/userkohya/LoRA-Datasets/daedream/lora_tests/daedream-net-alpha-4-net-dim-8-2500steps_ver-a1.0', '--reg_data_dir=/home/user/kohya/LoRA-Datasets/daedream/reg/', '--pretrained_model_name_or_path=/home/user/stable-diffusion-webui/models/Stable-diffusion/sd_xl_base_1.0.safetensors', '--output_name=daedream-net-alpha-4-net-dim-8-2500steps_ver-a1.0_', '--learning_rate=.05', '--unet_lr=.05', '--text_encoder_lr=.05', '--max_train_steps=2500', '--save_every_n_steps=500', '--resolution=512', '--enable_bucket', '--min_bucket_reso=418', '--max_bucket_reso=627', '--train_batch_size=1', '--network_dim=8', '--network_alpha=4', '--optimizer_type=Prodigy', '--lr_scheduler=cosine_with_restarts', '--noise_offset=0.0005', '--seed=0', '--clip_skip=1', '--sample_every_n_steps=500', '--sample_prompts=/home/user/kohya/LoRA-Datasets/daedream/test_prompts.txt', '--sample_sampler=k_euler_a', '--gradient_accumulation_steps=1', '--min_snr_gamma=5']' died with <Signals.SIGKILL: 9>.

this is on a 3090 ti and other trainings have worked.

kohya-ss commented 9 months ago

The process is seemed to be killed in loading U-Net. It requires a lot of memory, and SIGKILL: 9 seems to happen when there is not enough memory. Please terminate other processes to free the main memory. If you have enough VRAM, --lowram option may help by using VRAM in loading.

Reaper176 commented 9 months ago

12GB of RAM, 24GB VRAM, and there are no other processes running except the OS itself (and it's processes). I have done as you stated and the error continues. is 12 and 24 not enough?

kohya-ss commented 9 months ago

24GB VRAM is enough, but I think 12GB RAM would not be sufficient even with --lowram option. Is it possible to expand the swap space?

Reaper176 commented 9 months ago

done and done. ty much for the help. <3