Open arlofaria opened 1 year ago
@nateanl Temporary assigning you. Can you take a look at it?
accumulate_grad_batches
will be unusable if manual backward is used, but there seems a solution by pytorch_lightning: https://pytorch-lightning.readthedocs.io/en/stable/model/manual_optimization.html#gradient-accumulation
because the user would need to accordingly scale the number of max updates.
I'm not sure if I understand it. could you elaborate on the scaling here? @arlofaria
Thanks for the link! I'll try that out...
I might be misunderstanding PyTorch Lightning, but I think with this manual backward approach it might be necessary to set Trainer(max_steps=args.max_updates*N)
to achieve the same effect as with Trainer(accumulate_grad_batches=N)
.
I see. that makes sense. I will work on supporting the equivalent solution to Trainer(accumulate_grad_batches=N)
in the recipe, which has the same training logic in https://pytorch-lightning.readthedocs.io/en/stable/model/manual_optimization.html#gradient-accumulation and max_steps is converted to max_updates * N
implicitly.
Thanks!
In that case, you might also need to be careful to scale the loss by 1/N
(as well as by WORLD_SIZE / sum_of_frames
as discussed in #2744).
Alternatively, would it be easier to revert the custom training_step()
(but keeping the frame-normalized loss adjustment) and use PyTorch Lightning's automatic optimization, considering that the problem with Trainer(precision=16)
now seems to be resolved?
π The feature
It would be nice if gradient accumulation functionality could be added to the HuBERT recipe.
Motivation, pitch
Using gradient accumulation can simulate a larger cluster / larger effective batch sizes, for the purpose of replicating others' results or maintaining consistency across experiments that run on different numbers of GPUs.
For example, FAIR's HuBERT experiments ran on 32+ GPUs, but I generally have access to fewer devices than that.
Alternatives
In an earlier version of the recipes, before
HuBERTPreTrainModule.automatic_optimization
was set toFalse
in #2744, it was as simple as passingaccumulate_grad_batches=...
as an argument to PyTorch Lightning'sTrainer
. However, that now fails with this message:So we could perhaps revert to the automatic optimization -- but then we'd lose the training step's customized NaN-handling, loss normalization, gradient clipping, and AMP training.
Alternatively, one can try to tweak the learning rate parameters (max and schedule) for each experiment, but that's rather tricky.
Additional context
I think some gradient accumulation functionality could be added to the custom
training_step
by tracking thebatch_idx
and only performing an update on accumulated gradients every N batches. However, this it might not have the same semantics asTrainer(accumulate_grad_batches=...)
because the user would need to accordingly scale the number of max updates.