microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
20.2k stars 2.55k forks source link

BEiT: Gradient Accumulation during Pre-Training #494

Closed lychrel closed 2 years ago

lychrel commented 3 years ago

BEiT supports grad accum during fine-tuning, but not during pretraining. I've implemented it during pre-training (in engine_for_pretraining.py) by following the authors' fine-tuning implementation (fairly straightforward, in engine_for_finetuning.py). The only significant difference I'm aware of is not using EMA.

However, I've observed a slight degradation in downstream fine-tuning performance when using the weights trained with gradient accumulation versus those trained without (i.e. with a larger per-device batch size).

I was wondering if there's any obvious reason this would occur and / or any reason grad accum wasn't included as an option in pre-training by default?

donglixp commented 2 years ago

If it's correctly implemented, the results should be quite similar as gradient accumulation is a general pytorch trick. We usually allocate enough GPU resources to the pre-training jobs in order to accelerate the experiments. So gradient accumulation was not required in our previous experiments.

lersouza commented 2 years ago

@lychrel , have you found the issue with that? One question that I'd like to ask is about the number of update steps.

For instance, if a model is pretrained with a batch size of 128 for 150 epochs and you want to fit in GPU memory 64 examples, you would use a gradient accumulation of 2. However, given your dataset still have the same size, that would result in 75 epochs of updating weights ... The other 75 would just be accumulating gradients.

Does that make sense? Maybe this would result in a "less pretrained model"?