Open vinm007 opened 5 months ago
Sorry that you're facing confusion.
The multiplication of args.gradient_accumulation_steps in train_instruct_pix2pix.py script
Why should it not be the case? It's based on steps and without the GA steps, the calculation would be improper, no?
Ccing @muellerzr for further clarification in light of accelerate
.
Sorry that you're facing confusion.
The multiplication of args.gradient_accumulation_steps in train_instruct_pix2pix.py script
Why should it not be the case? It's based on steps and without the GA steps, the calculation would be improper, no?
Ccing @muellerzr for further clarification in light of
accelerate
.
I am not sure about the calculation but do find it different in these two scripts. Is one of them outdated or wrong?
To deduce this calculation, I tried to understand the how global_step
is updated but couldn't understand. In general, it is incremented by 1 when the accelerator.sync_gradients
is true. The following code is used to update global_step
if accelerator.sync_gradients:
if args.use_ema:
ema_unet.step(unet.parameters())
progress_bar.update(1)
global_step += 1
accelerator.log({"train_loss": train_loss}, step=global_step)
train_loss = 0.0
What does this code imply?
Is this counter updated by each gpu (multiple process scenario) or not? Does this sync_gradient
flag takes care of gradient accumulation or not? Based on that only, I can deduce the calculation
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Describe the bug
Hi, I have been working on training scripts for multiple models (T2I, IP2P) and found the different logic to calculate
step
andepoch
while resuming training different across scripts. Intrain_text_to_image.py
script linkIn
train_instruct_pix2pix.py
script linkIn the similar issue, some changes are made for the progress bar inconsistency but I am bit confused with the following things:-
args.gradient_accumulation_steps
intrain_instruct_pix2pix.py
scriptaccelerate
documentationIf we are using multiple GPUs with gradient accumulation, at what event
global_step
is updated- is it being updated independently by each GPU (since the code is not wrapped withaccelerator.is_main_process
), also how accumulation affecting the tracking here?Reproduction
-
Logs
No response
System Info
-
Who can help?
@sayakpaul