huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
26.15k stars 5.38k forks source link

[Training] Resume checkpoint global step inconsistent/confusion across scripts #8296

Open vinm007 opened 5 months ago

vinm007 commented 5 months ago

Describe the bug

Hi, I have been working on training scripts for multiple models (T2I, IP2P) and found the different logic to calculate step and epoch while resuming training different across scripts. In train_text_to_image.py script link

accelerator.print(f"Resuming from checkpoint {path}")
accelerator.load_state(os.path.join(args.output_dir, path))
global_step = int(path.split("-")[1])
initial_global_step = global_step
first_epoch = global_step // num_update_steps_per_epoch

In train_instruct_pix2pix.py script link

accelerator.print(f"Resuming from checkpoint {path}")
accelerator.load_state(os.path.join(args.output_dir, path))
global_step = int(path.split("-")[1])

resume_global_step = global_step * args.gradient_accumulation_steps
first_epoch = global_step // num_update_steps_per_epoch
resume_step = resume_global_step % (num_update_steps_per_epoch * args.gradient_accumulation_steps)

In the similar issue, some changes are made for the progress bar inconsistency but I am bit confused with the following things:-

  1. The multiplication of args.gradient_accumulation_steps in train_instruct_pix2pix.py script
  2. In general, when does global-step indicate and how does it's being updated, in both the scripts I can see the following code but couldn't understand it from accelerate documentation
    if accelerator.sync_gradients:
    if args.use_ema:
        ema_unet.step(unet.parameters())
    progress_bar.update(1)
    global_step += 1
    accelerator.log({"train_loss": train_loss}, step=global_step)
    train_loss = 0.0

    If we are using multiple GPUs with gradient accumulation, at what event global_step is updated- is it being updated independently by each GPU (since the code is not wrapped with accelerator.is_main_process), also how accumulation affecting the tracking here?

Reproduction

-

Logs

No response

System Info

-

Who can help?

@sayakpaul

sayakpaul commented 5 months ago

Sorry that you're facing confusion.

The multiplication of args.gradient_accumulation_steps in train_instruct_pix2pix.py script

Why should it not be the case? It's based on steps and without the GA steps, the calculation would be improper, no?

Ccing @muellerzr for further clarification in light of accelerate.

vinm007 commented 5 months ago

Sorry that you're facing confusion.

The multiplication of args.gradient_accumulation_steps in train_instruct_pix2pix.py script

Why should it not be the case? It's based on steps and without the GA steps, the calculation would be improper, no?

Ccing @muellerzr for further clarification in light of accelerate.

I am not sure about the calculation but do find it different in these two scripts. Is one of them outdated or wrong? To deduce this calculation, I tried to understand the how global_step is updated but couldn't understand. In general, it is incremented by 1 when the accelerator.sync_gradients is true. The following code is used to update global_step

 if accelerator.sync_gradients:
    if args.use_ema:
        ema_unet.step(unet.parameters())
    progress_bar.update(1)
    global_step += 1
    accelerator.log({"train_loss": train_loss}, step=global_step)
    train_loss = 0.0

What does this code imply? Is this counter updated by each gpu (multiple process scenario) or not? Does this sync_gradient flag takes care of gradient accumulation or not? Based on that only, I can deduce the calculation

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.