Open murtaza-nasir opened 3 months ago
add one argument: --network_weights <your_lastest_lora_path>
or give argument --save_state
at the begining of training from scratch,
and use this argument for resuming --resume="STATE_IN_OUTPUT_DIR"
These option utilize saving and loading optimizer state and weight.
Thank you! Will try these.
I get the following error when I use --save_state and --resume. What could be the problem causing this?
https://gist.github.com/murtaza-nasir/beb038e7e337a97ae0335bcdfd2d6307
I am using this command.
accelerate launch --mixed_precision bf16 --num_cpu_threads_per_process 1 flux_train_network.py --pretrained_model_name_or_path ~/work/media/flux_lora/models/flux1-dev.safetensors --clip_l ~/work/media/flux_lora/models/clip_l.safetensors --t5xxl ~/work/media/flux_lora/models/t5xxl_fp16.safetensors --ae ~/work/media/flux_lora/models/ae.safetensors --cache_latents_to_disk --save_model_as safetensors --sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 --seed 42 --gradient_checkpointing --mixed_precision bf16 --full_bf16 --save_precision bf16 --network_module networks.lora_flux --network_dim 8 --optimizer_type adamw8bit --learning_rate 1e-4 --network_train_unet_only --cache_text_encoder_outputs --cache_text_encoder_outputs_to_disk --fp8_base --highvram --max_train_epochs 300 --save_every_n_epochs 4 --t5xxl_max_token_length 512 --dataset_config dataset_resized_fulltext.toml --output_dir ~/work/media/flux_lora/training-output --output_name mb_resized_detailed --timestep_sampling sigmoid --model_prediction_type raw --guidance_scale 1.0 --loss_type l2 --save_state --resume="/home/murtaza/work/media/flux_lora/training-output/mb_resized_detailed-state"
Not sure if it's an error but I'm having the issue where resuming the model training seems to load everything up successfully, but the training steps are reset back to 0, as opposed to picking up from the last epoch it was on
(ex: If I stop training after 2/10 epochs, or on the second state, and then resume training, it'll load up the state folder successfully but will start back at 0/10 epochs)
I would assume this would completely mess up the schedulers. Is this intentional, or is there some other command I need to include alongside the --resume command?
I'm using the following command for training:
If training is interrupted, running this command again starts from scratch. How can I modify the command or process to resume training from the last checkpoint instead of starting over?