kohya-ss / sd-scripts

Apache License 2.0
5.35k stars 881 forks source link

How to resume interrupted training for Flux? #1520

Open murtaza-nasir opened 3 months ago

murtaza-nasir commented 3 months ago

I'm using the following command for training:

accelerate launch --mixed_precision bf16 --num_cpu_threads_per_process 1 flux_train_network.py --pretrained_model_name_or_path ~/work/media/flux_lora/models/flux1-dev.safetensors --clip_l ~/work/media/flux_lora/models/clip_l.safetensors --t5xxl ~/work/media/flux_lora/models/t5xxl_fp16.safetensors --ae ~/work/media/flux_lora/models/ae.safetensors --cache_latents_to_disk --save_model_as safetensors --sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 --seed 42 --gradient_checkpointing --mixed_precision bf16 --save_precision bf16 --network_module networks.lora_flux --network_dim 4 --optimizer_type adamw8bit --learning_rate 1e-4 --network_train_unet_only --cache_text_encoder_outputs --cache_text_encoder_outputs_to_disk --fp8_base --highvram --max_train_epochs 64 --save_every_n_epochs 4 --dataset_config dataset.toml --output_dir ~/work/media/flux_lora/training-output --output_name flux-lora-output --timestep_sampling sigmoid --model_prediction_type raw --guidance_scale 1.0 --loss_type l2

If training is interrupted, running this command again starts from scratch. How can I modify the command or process to resume training from the last checkpoint instead of starting over?

HelloWarcraft commented 3 months ago

add one argument: --network_weights <your_lastest_lora_path>

BootsofLagrangian commented 3 months ago

or give argument --save_state at the begining of training from scratch,

and use this argument for resuming --resume="STATE_IN_OUTPUT_DIR"

These option utilize saving and loading optimizer state and weight.

murtaza-nasir commented 3 months ago

Thank you! Will try these.

murtaza-nasir commented 3 months ago

I get the following error when I use --save_state and --resume. What could be the problem causing this?

https://gist.github.com/murtaza-nasir/beb038e7e337a97ae0335bcdfd2d6307

I am using this command.

accelerate launch --mixed_precision bf16 --num_cpu_threads_per_process 1 flux_train_network.py --pretrained_model_name_or_path ~/work/media/flux_lora/models/flux1-dev.safetensors --clip_l ~/work/media/flux_lora/models/clip_l.safetensors --t5xxl ~/work/media/flux_lora/models/t5xxl_fp16.safetensors --ae ~/work/media/flux_lora/models/ae.safetensors --cache_latents_to_disk --save_model_as safetensors --sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 --seed 42 --gradient_checkpointing --mixed_precision bf16 --full_bf16 --save_precision bf16 --network_module networks.lora_flux --network_dim 8 --optimizer_type adamw8bit --learning_rate 1e-4 --network_train_unet_only --cache_text_encoder_outputs --cache_text_encoder_outputs_to_disk --fp8_base --highvram --max_train_epochs 300 --save_every_n_epochs 4 --t5xxl_max_token_length 512 --dataset_config dataset_resized_fulltext.toml --output_dir ~/work/media/flux_lora/training-output --output_name mb_resized_detailed --timestep_sampling sigmoid --model_prediction_type raw --guidance_scale 1.0 --loss_type l2 --save_state --resume="/home/murtaza/work/media/flux_lora/training-output/mb_resized_detailed-state"
setothegreat commented 3 months ago

Not sure if it's an error but I'm having the issue where resuming the model training seems to load everything up successfully, but the training steps are reset back to 0, as opposed to picking up from the last epoch it was on

(ex: If I stop training after 2/10 epochs, or on the second state, and then resume training, it'll load up the state folder successfully but will start back at 0/10 epochs)

I would assume this would completely mess up the schedulers. Is this intentional, or is there some other command I need to include alongside the --resume command?