Closed keleog closed 3 years ago
@sgugger @LysandreJik
pls help
A reported step is a training step (with an optimizer pass). When you increase gradient accumulation, you take more input batches to do one step, so it's normal to have less training steps per second.
Please note that the issues are for bugs and feature requests only, general questions like this one should go on the forums, which is why I'm closing this.
Yeah, I am aware that you take more input batches to do one step, so it's normal to have less training steps per second. However, the actual training time is much longer. Is this normal? Shouldn't it be faster or at least equals to a gradient accumulation step of 1.
You did not report total training it. Since there are 4 (or 8) times less batches it should stay the same even if you have a slower iteration per second total.
When training with a batch size of 32 (grad accummulation step = 1), training speed is approximately 6 it/s, however I increase gradient accummulation step to 4 or 8 (equivalent to batch size of 128 and 256), speed reduces to 1.03 it/s.
Is this expected behaviour?
Environment info
transformers
version: 4.2.1Who can help
Information
Model I am using (Bert, XLNet ...): XLMR
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
1. 2. 3.
Expected behavior