huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
26.34k stars 5.42k forks source link

train_dreambooth_lora_sdxl.py & diffusers0.27.0dev error during training Signals.SIGKILL: 9 #7322

Closed SylwiaNowakowska closed 8 months ago

SylwiaNowakowska commented 8 months ago

Describe the bug

The latest train_dreambooth_lora_sdxl.py script with diffusers 0.27.0dev produces an error Signals.SIGKILL: 9. The train_dreambooth_lora_sdxl.py script (version from 28.02: 7db935a) works with diffusers 0.26.3, but the issue in this case is that resuming from checkpoint does not work. I have seen that the issue has been fixed later in commmit 5f150c4 with the script requiring 0.27.0dev - I have tested also that and it results with the same error: Signals.SIGKILL: 9.

Reproduction

!accelerate launch train_dreambooth_lora_sdxl.py \ --pretrained_model_name_or_path='stabilityai/stable-diffusion-xl-base-1.0' \ --pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix" \ --cache_dir='.../Project/cache_dir' \ --dataset_name='.../Project/DATASET'\ --image_column="image" \ --caption_column="text" \ --repeats=1 \ --instance_prompt="In the style of MaGHY" \ --validation_prompt="In the style of MaGHY, a MLO mammogram." \ --num_validation_images=4 \ --validation_epochs=1 \ --output_dir='.../Project/OUTPUT/03_RUN'\ --seed=42 \ --resolution=1024 \ --train_text_encoder \ --train_batch_size=1 \ --sample_batch_size=1 \ --max_train_steps=200 \ --checkpointing_steps=10 \ --checkpoints_total_limit=100 \ --gradient_accumulation_steps=5 \ --gradient_checkpointing \ --learning_rate=2e-04 \ --text_encoder_lr=5e-6 \ --lr_scheduler="constant" \ --snr_gamma=5.0 \ --lr_warmup_steps=500 \ --lr_num_cycles=1 \ --lr_power=1.0 \ --dataloader_num_workers=0 \ --optimizer="AdamW" \ --adam_beta1=0.9 \ --adam_beta2=0.999 \ --adam_weight_decay=1e-04 \ --adam_weight_decay_text_encoder=1e-03 \ --adam_epsilon=1e-08 \ --max_grad_norm=1.0 \ --report_to=wandb \ --mixed_precision="fp16" \ --prior_generation_precision="fp16" \ --local_rank=-1 \ --use_8bit_adam \ --rank=4

Logs

03/14/2024 11:39:36 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'dynamic_thresholding_ratio', 'variance_type', 'clip_sample_range', 'thresholding', 'rescale_betas_zero_snr'} was not found in config. Values will be initialized to default values.
{'latents_std', 'latents_mean'} was not found in config. Values will be initialized to default values.
{'reverse_transformer_layers_per_block', 'dropout', 'attention_type'} was not found in config. Values will be initialized to default values.
Resolving data files: 100%|██████████████| 8000/8000 [00:00<00:00, 88922.42it/s]
Traceback (most recent call last):
  File "/home/brayz/anaconda3/envs/Diff_Tuning_v2_env/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/brayz/anaconda3/envs/Diff_Tuning_v2_env/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/home/brayz/anaconda3/envs/Diff_Tuning_v2_env/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1023, in launch_command
    simple_launcher(args)
  File "/home/brayz/anaconda3/envs/Diff_Tuning_v2_env/lib/python3.10/site-packages/accelerate/commands/launch.py", line 643, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/brayz/anaconda3/envs/Diff_Tuning_v2_env/bin/python', 'train_dreambooth_lora_sdxl_0.27.0_dev0.py', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0', '--pretrained_vae_model_name_or_path=madebyollin/sdxl-vae-fp16-fix', '--cache_dir=/media/brayz/brayz_storage/MAMMO_DIFFUSION_Project/cache_dir', '--dataset_name=/media/brayz/brayz_storage/MAMMO_DIFFUSION_Project/DATASET_EMBED_FOR_TRAINING_FLIPPED', '--image_column=image', '--caption_column=text', '--repeats=1', '--instance_prompt=In the style of MammoGHY', '--validation_prompt=In the style of MammoGHY, a MLO mammogram of a breast with a high tissue density.', '--num_validation_images=4', '--validation_epochs=1', '--output_dir=/media/brayz/brayz_storage/MAMMO_DIFFUSION_Project/OUTPUT_EMBED/03_RUN', '--seed=42', '--resolution=1024', '--train_text_encoder', '--train_batch_size=1', '--sample_batch_size=1', '--max_train_steps=200', '--checkpointing_steps=10', '--checkpoints_total_limit=100', '--gradient_accumulation_steps=5', '--gradient_checkpointing', '--learning_rate=2e-04', '--text_encoder_lr=5e-6', '--lr_scheduler=constant', '--snr_gamma=5.0', '--lr_warmup_steps=500', '--lr_num_cycles=1', '--lr_power=1.0', '--dataloader_num_workers=0', '--optimizer=AdamW', '--adam_beta1=0.9', '--adam_beta2=0.999', '--adam_weight_decay=1e-04', '--adam_weight_decay_text_encoder=1e-03', '--adam_epsilon=1e-08', '--max_grad_norm=1.0', '--report_to=wandb', '--mixed_precision=fp16', '--prior_generation_precision=fp16', '--local_rank=-1', '--use_8bit_adam', '--rank=4']' died with <Signals.SIGKILL: 9>.

System Info

GPU: NVIDIA GeForce RTX 3090 (24 GB)

Who can help?

@yiyixuxu @sayakpaul @DN6 I would appreciate your help

sayakpaul commented 8 months ago

I cannot reproduce the issue on my end. I would recommend upgrading your PyTorch version as well as other libraries such as transformers and accelerate.

Also, the error logs aren't descriptive. We don't have any way to confirm which part of the code causes the issue.

SylwiaNowakowska commented 8 months ago

Thank you so much for the quick answer. I have upgraded transformers to 4.38.2 and accelerate to 0.28.0. I did not upgrade torch because of CUDA compability issues. I get the same error. In case you would have some further suggestions, I would be happy to test them.

sayakpaul commented 8 months ago

Unfortunate situation. I am unable to reproduce the error on my end :/

SylwiaNowakowska commented 8 months ago

anyway thx for your support!