LORA caption training: extremely long pauses between epochs

kohya-ss / sd-scripts

Apache License 2.0

5.04k stars 844 forks source link

LORA caption training: extremely long pauses between epochs #32

Open sashasubbbb opened 1 year ago

sashasubbbb commented 1 year ago

For some reason there is a large delay when epochs change, making training much slower, what could cause this? My settings: accelerate launch --num_cpu_threads_per_process 10 train_network.py --pretrained_model_name_or_path=B:\AIimages\stable-diffusion-webui\models\Stable-diffusion\model.ckpt --train_data_dir=B:\AIimages\training\data --output_dir=B:\train\out\ --in_json=B:\AIimages\training\data\meta_lat.json --resolution=512,512 --prior_loss_weight=1.0 --train_batch_size=4 --learning_rate=1e-3 --max_train_steps=15000 --use_8bit_adam --xformers --gradient_checkpointing --mixed_precision=fp16 --save_every_n_epochs=10 --network_module=networks.lora --shuffle_caption --unet_lr=3e-4 --text_encoder_lr=3e-5 --lr_scheduler=constant --save_model_as=safetensors --seed=115

kohya-ss commented 1 year ago

I think it might be accelerate issue. Do you have the same issue in fine_tune.py?

I think one of the current options is to use --dataset_repeats to make larger the epoch. Please try to use this for now.

sashasubbbb commented 1 year ago

Yeah, this solution worked ok but without --cache_latents on. When i use caching, it seems to ignore dataset_repeats flag. Is there any other way to set multiple repeats of dataset training to single epoch? Right now because of these pauses between epochs, training time increases up to 2 times

Edit: figured it out, you can rename your dataset concept folder to #_concept, and epochs to 1. training will be repeated # times without switchings epochs

kohya-ss commented 1 year ago

The dataset_repeats option only works for DreamBooth method (without metadata .json). If you use the option with metadata json, there might be a bug for handling cache_latents or dataset_repeats. I'm working on the refactoring, the bug will be solved with the refactoring.

If the repeating # of the folder name works, it will be good!

kohya-ss commented 1 year ago

I've added ``--max_data_loader_n_workers" option in #72. Small number of the workers might reduce the pausing between epochs (default is 8).