Issue when resuming training from checkpoint with mullti-gpu pipeline `train_text_to_image.py`

philippe-ml6 commented 1 year ago

Describe the bug

Currently, there seems to be an issue when resuming training from a checkpoint using the --resume_from_checkpoint argument:

FileNotFoundError: [Errno 2] No such file or directory: 'output_model/checkpoint-10/random_states_1.pkl

I am using the same multi-gpu configuration for the initial training and resuming (currently 2 A100's). The random_states_0.pkl file seems to be save properly in the checkpoint but random_states_1.pkl is not there.

Reproduction

training launch accelerate launch --config_file accelerate_config.yaml --mixed_precision="fp16" train_text_to_image.py --pretrained_model_name_or_path=models/stable-diffusion-v1-5-fp32 --dataset_name=clean-cut-dummy --use_ema --resolution=512 --center_crop --random_flip --train_batch_size=4 --gradient_accumulation_steps=4 --gradient_checkpointing --max_train_steps=25 --learning_rate=1e-05 --max_grad_norm=1 --lr_scheduler="constant" --lr_warmup_steps=0 --output_dir="" --seed 1024 --resume_from_checkpoint output_model/checkpoint-10

accelerate_config.yaml compute_environment: LOCAL_MACHINE deepspeed_config: {} distributed_type: MULTI_GPU downcast_bf16: 'no' fsdp_config: {} gpu_ids: all machine_rank: 0 main_process_ip: null main_process_port: null main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 2 rdzv_backend: static same_network: true use_cpu: false

Logs

01/24/2023 15:31:30 - INFO - __main__ - ***** Running training *****
01/24/2023 15:31:30 - INFO - __main__ -   Num examples = 15
01/24/2023 15:31:30 - INFO - __main__ -   Num Epochs = 25
01/24/2023 15:31:30 - INFO - __main__ -   Instantaneous batch size per device = 4
01/24/2023 15:31:30 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 32
01/24/2023 15:31:30 - INFO - __main__ -   Gradient Accumulation steps = 4
01/24/2023 15:31:30 - INFO - __main__ -   Total optimization steps = 25
Resuming from checkpoint checkpoint-10
01/24/2023 15:31:30 - INFO - accelerate.accelerator - Loading states from output_model/checkpoint-10
01/24/2023 15:31:32 - INFO - accelerate.checkpointing - All model weights loaded successfully
01/24/2023 15:31:38 - INFO - accelerate.checkpointing - All optimizer states loaded successfully
01/24/2023 15:31:38 - INFO - accelerate.checkpointing - All scheduler states loaded successfully
01/24/2023 15:31:38 - INFO - accelerate.checkpointing - GradScaler state loaded successfully
01/24/2023 15:31:38 - INFO - accelerate.checkpointing - All random states loaded successfully
01/24/2023 15:31:38 - INFO - accelerate.accelerator - Loading in 1 custom states
01/24/2023 15:31:38 - INFO - accelerate.checkpointing - Loading the state of EMAModel from output_model/checkpoint-10/custom_checkpoint_0.pkl
Traceback (most recent call last):
  File "text_to_image_latest.py", line 792, in <module>
    main()
  File "text_to_image_latest.py", line 685, in main
    accelerator.load_state(os.path.join(args.output_dir, path))
  File "/home/jupyter/venv-latest/lib/python3.7/site-packages/accelerate/accelerator.py", line 1400, in load_state
    load_accelerator_state(input_dir, models, optimizers, schedulers, self.state.process_index, self.scaler)
  File "/home/jupyter/venv-latest/lib/python3.7/site-packages/accelerate/checkpointing.py", line 158, in load_accelerator_state
    states = torch.load(os.path.join(input_dir, f"{RNG_STATE_NAME}_{process_index}.pkl"))
  File "/home/jupyter/venv-latest/lib/python3.7/site-packages/torch/serialization.py", line 699, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/home/jupyter/venv-latest/lib/python3.7/site-packages/torch/serialization.py", line 230, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/home/jupyter/venv-latest/lib/python3.7/site-packages/torch/serialization.py", line 211, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'output_model/checkpoint-10/random_states_1.pkl'
Steps:   0%|                                                                                                                           | 0/15 [00:00<?, ?it/s]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 9080 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 9081) of binary: /home/jupyter/venv-latest/bin/python3

System Info

diffusers version: 0.11.0
Platform: Linux-4.19.0-23-cloud-amd64-x86_64-with-debian-10.13
Python version: 3.7.12
PyTorch version (GPU?): 1.12.0+cu116 (True)
Huggingface_hub version: 0.11.1
Transformers version: 4.21.0
Using GPU in script?: A100 (x2)
Using distributed or parallel set-up in script?: distributed (multi-gpu)

patrickvonplaten commented 1 year ago

Thanks for the issue @philippe-ml6 :-)

I think we fixed this just recently, see: https://github.com/huggingface/diffusers/pull/2079

Could you try with current "main" and see if it works now? Thanks

cc @pcuenca

philippe-ml6 commented 1 year ago

Thanks for the reply @patrickvonplaten :) I tried again with the latest version of the script and diffusers and the issue still persists

patrickvonplaten commented 1 year ago

Hey @philippe-ml6,

I just tried to run text-to-image with the current "main" version of diffusers and it works just fine with:

--resume_from_checkpoint

First run (without resume_from_checkpoint since we haven't done a run yet):


export MODEL_NAME="OFA-Sys/small-stable-diffusion-v0"
export INSTANCE_DIR="./pokemon"
export dataset_name="lambdalabs/pokemon-blip-captions"

accelerate launch train_text_to_image.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --dataset_name=$dataset_name \ --resolution=512 --center_crop --random_flip \ --train_batch_size=1 \ --gradient_accumulation_steps=4 \ --gradient_checkpointing \ --max_train_steps=150 \ --learning_rate=1e-05 \ --checkpointing_steps=10 \ --max_grad_norm=1 \ --lr_scheduler="constant" --lr_warmup_steps=0 \ --output_dir="sd-pokemon-model"


2. Now you can train for a bit then stop it and then run with resume from checkpoint:

export MODEL_NAME="OFA-Sys/small-stable-diffusion-v0" export INSTANCE_DIR="./pokemon" export dataset_name="lambdalabs/pokemon-blip-captions"

accelerate launch train_text_to_image.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --dataset_name=$dataset_name \ --resolution=512 --center_crop --random_flip \ --train_batch_size=1 \ --gradient_accumulation_steps=4 \ --gradient_checkpointing \ --max_train_steps=150 \ --learning_rate=1e-05 \ --checkpointing_steps=10 \ --max_grad_norm=1 \ --lr_scheduler="constant" --lr_warmup_steps=0 \ --output_dir="sd-pokemon-model" \ --resume_from_checkpoint="latest" \

patrickvonplaten commented 1 year ago

Could you try this and see if it works?

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / diffusers