kohya-ss / sd-scripts

Apache License 2.0
5.15k stars 857 forks source link

During training, loss=nan and a broken LoRA is generated #201

Open acncagua opened 1 year ago

acncagua commented 1 year ago

Learning LoRA with the following parameters results in loss=nan. The resulting LoRA file is corrupt Is there anything I can do to improve it? xformers and others are recommended

accelerate launch --num_cpu_threads_per_process 1 train_network.py --pretrained_model_name_or_path=J:\stable-diffusion-webui\models\Stable-diffusion\zmodels_0_marge_source\NAIbasil.safetensors --train_data_dir=J:\sd-scripts\training --output_dir=J:\sd-scripts\output --reg_data_dir=J:\sd-scripts\seisoku --resolution=512,512 --train_batch_size=6 --unet_lr=5e-5 --text_encoder_lr=5e-3 --max_train_epochs=10 --save_every_n_epochs=1 --save_model_as=safetensors --clip_skip=2 --seed=42 --color_aug --min_bucket_reso=320 --max_bucket_reso=1024 --network_module=networks.lora --lr_scheduler=cosine_with_restarts --lr_warmup_steps=500 --keep_tokens=2 --shuffle_caption --network_dim=128 --network_alpha=64 --enable_bucket --mixed_precision=fp16 --xformers --use_8bit_adam --lr_scheduler_num_cycles=4 --caption_extension=.txt
--persistent_data_loader_workers --bucket_no_upscale --caption_dropout_rate=0.05

2023-02-17 (2)

abhiishekpal commented 1 year ago

@acncagua Did you face the issue even with a much lower learning rate?

FlyHighest commented 1 year ago

I solve this issue by disabling xformers during training.

zx96-001 commented 1 year ago

where can i find naibasil.safetensor ? wanted to check it out, and I see it's mentioned in the thread, sorry for the comment is not really related to your problem ^^