Increasing gradient accummulation steps significantly slows down training

keleog commented 3 years ago

When training with a batch size of 32 (grad accummulation step = 1), training speed is approximately 6 it/s, however I increase gradient accummulation step to 4 or 8 (equivalent to batch size of 128 and 256), speed reduces to 1.03 it/s.

Is this expected behaviour?

Environment info

transformers version: 4.2.1
Platform: Linux
Python version: 3.7.4
PyTorch version (GPU?): 1.7.1+cu101
Tensorflow version (GPU?):
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: Distributed

Who can help

Information

Model I am using (Bert, XLNet ...): XLMR

The problem arises when using:

[ ] the official example scripts: (give details below) Trainer script
[ ] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[ ] my own task or dataset: (give details below) masked language model training

To reproduce

Steps to reproduce the behavior:

1. 2. 3.

Expected behavior

keleog commented 3 years ago

@sgugger @LysandreJik

pls help

sgugger commented 3 years ago

A reported step is a training step (with an optimizer pass). When you increase gradient accumulation, you take more input batches to do one step, so it's normal to have less training steps per second.

Please note that the issues are for bugs and feature requests only, general questions like this one should go on the forums, which is why I'm closing this.

keleog commented 3 years ago

Yeah, I am aware that you take more input batches to do one step, so it's normal to have less training steps per second. However, the actual training time is much longer. Is this normal? Shouldn't it be faster or at least equals to a gradient accumulation step of 1.

sgugger commented 3 years ago

You did not report total training it. Since there are 4 (or 8) times less batches it should stay the same even if you have a slower iteration per second total.

huggingface / transformers