Closed philippe-ml6 closed 1 year ago
Thanks for the issue @philippe-ml6 :-)
I think we fixed this just recently, see: https://github.com/huggingface/diffusers/pull/2079
Could you try with current "main" and see if it works now? Thanks
cc @pcuenca
Thanks for the reply @patrickvonplaten :) I tried again with the latest version of the script and diffusers and the issue still persists
Hey @philippe-ml6,
I just tried to run text-to-image with the current "main" version of diffusers
and it works just fine with:
--resume_from_checkpoint
export MODEL_NAME="OFA-Sys/small-stable-diffusion-v0"
export INSTANCE_DIR="./pokemon"
export dataset_name="lambdalabs/pokemon-blip-captions"
accelerate launch train_text_to_image.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --dataset_name=$dataset_name \ --resolution=512 --center_crop --random_flip \ --train_batch_size=1 \ --gradient_accumulation_steps=4 \ --gradient_checkpointing \ --max_train_steps=150 \ --learning_rate=1e-05 \ --checkpointing_steps=10 \ --max_grad_norm=1 \ --lr_scheduler="constant" --lr_warmup_steps=0 \ --output_dir="sd-pokemon-model"
2. Now you can train for a bit then stop it and then run with resume from checkpoint:
export MODEL_NAME="OFA-Sys/small-stable-diffusion-v0" export INSTANCE_DIR="./pokemon" export dataset_name="lambdalabs/pokemon-blip-captions"
accelerate launch train_text_to_image.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --dataset_name=$dataset_name \ --resolution=512 --center_crop --random_flip \ --train_batch_size=1 \ --gradient_accumulation_steps=4 \ --gradient_checkpointing \ --max_train_steps=150 \ --learning_rate=1e-05 \ --checkpointing_steps=10 \ --max_grad_norm=1 \ --lr_scheduler="constant" --lr_warmup_steps=0 \ --output_dir="sd-pokemon-model" \ --resume_from_checkpoint="latest" \
Could you try this and see if it works?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Describe the bug
Currently, there seems to be an issue when resuming training from a checkpoint using the
--resume_from_checkpoint
argument:FileNotFoundError: [Errno 2] No such file or directory: 'output_model/checkpoint-10/random_states_1.pkl
I am using the same multi-gpu configuration for the initial training and resuming (currently 2 A100's). The
random_states_0.pkl
file seems to be save properly in the checkpoint butrandom_states_1.pkl
is not there.Reproduction
training launch
accelerate launch --config_file accelerate_config.yaml --mixed_precision="fp16" train_text_to_image.py --pretrained_model_name_or_path=models/stable-diffusion-v1-5-fp32 --dataset_name=clean-cut-dummy --use_ema --resolution=512 --center_crop --random_flip --train_batch_size=4 --gradient_accumulation_steps=4 --gradient_checkpointing --max_train_steps=25 --learning_rate=1e-05 --max_grad_norm=1 --lr_scheduler="constant" --lr_warmup_steps=0 --output_dir="" --seed 1024 --resume_from_checkpoint output_model/checkpoint-10accelerate_config.yaml
compute_environment: LOCAL_MACHINE deepspeed_config: {} distributed_type: MULTI_GPU downcast_bf16: 'no' fsdp_config: {} gpu_ids: all machine_rank: 0 main_process_ip: null main_process_port: null main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 2 rdzv_backend: static same_network: true use_cpu: falseLogs
System Info
diffusers
version: 0.11.0