Open hj13-mtlab opened 1 day ago
Pinging @sayakpaul for training scripts. I don't think args.lr_warmup_steps * args.gradient_accumulation_steps
is correct because you are already doing lesser number of gradient updates when usihng accumulation, so increasing the time it takes to reach true/peak LR does not make sense. I think lr_warmup_steps * num_processes
is correct so that each rank can get equal-ish number of learning steps going from low to true/peak LR.
In the example code from train_text_to_image_sdxl.py:
But in train_text_to_image.py:
Why is there such a difference in these two cases?