Enable gradient accumulation for HuBERT recipe

arlofaria commented 1 year ago

🚀 The feature

It would be nice if gradient accumulation functionality could be added to the HuBERT recipe.

Motivation, pitch

Using gradient accumulation can simulate a larger cluster / larger effective batch sizes, for the purpose of replicating others' results or maintaining consistency across experiments that run on different numbers of GPUs.

For example, FAIR's HuBERT experiments ran on 32+ GPUs, but I generally have access to fewer devices than that.

Alternatives

In an earlier version of the recipes, before HuBERTPreTrainModule.automatic_optimization was set to False in #2744, it was as simple as passing accumulate_grad_batches=... as an argument to PyTorch Lightning's Trainer. However, that now fails with this message:

MisconfigurationException: Automatic gradient accumulation is not supported for manual optimization.
Remove `Trainer(accumulate_grad_batches=...)` or switch to automatic optimization.

So we could perhaps revert to the automatic optimization -- but then we'd lose the training step's customized NaN-handling, loss normalization, gradient clipping, and AMP training.

Alternatively, one can try to tweak the learning rate parameters (max and schedule) for each experiment, but that's rather tricky.

Additional context

I think some gradient accumulation functionality could be added to the custom training_step by tracking the batch_idx and only performing an update on accumulated gradients every N batches. However, this it might not have the same semantics as Trainer(accumulate_grad_batches=...) because the user would need to accordingly scale the number of max updates.

mthrok commented 1 year ago

@nateanl Temporary assigning you. Can you take a look at it?

nateanl commented 1 year ago

accumulate_grad_batches will be unusable if manual backward is used, but there seems a solution by pytorch_lightning: https://pytorch-lightning.readthedocs.io/en/stable/model/manual_optimization.html#gradient-accumulation

because the user would need to accordingly scale the number of max updates.

I'm not sure if I understand it. could you elaborate on the scaling here? @arlofaria

arlofaria commented 1 year ago

Thanks for the link! I'll try that out...

I might be misunderstanding PyTorch Lightning, but I think with this manual backward approach it might be necessary to set Trainer(max_steps=args.max_updates*N) to achieve the same effect as with Trainer(accumulate_grad_batches=N).

nateanl commented 1 year ago

I see. that makes sense. I will work on supporting the equivalent solution to Trainer(accumulate_grad_batches=N) in the recipe, which has the same training logic in https://pytorch-lightning.readthedocs.io/en/stable/model/manual_optimization.html#gradient-accumulation and max_steps is converted to max_updates * N implicitly.

arlofaria commented 1 year ago

Thanks!

In that case, you might also need to be careful to scale the loss by 1/N (as well as by WORLD_SIZE / sum_of_frames as discussed in #2744).

Alternatively, would it be easier to revert the custom training_step() (but keeping the frame-normalized loss adjustment) and use PyTorch Lightning's automatic optimization, considering that the problem with Trainer(precision=16) now seems to be resolved?

pytorch / audio