huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.28k stars 26.85k forks source link

Increasing gradient accummulation steps significantly slows down training #10163

Closed keleog closed 3 years ago

keleog commented 3 years ago

When training with a batch size of 32 (grad accummulation step = 1), training speed is approximately 6 it/s, however I increase gradient accummulation step to 4 or 8 (equivalent to batch size of 128 and 256), speed reduces to 1.03 it/s.

Is this expected behaviour?

Environment info

Who can help

Information

Model I am using (Bert, XLNet ...): XLMR

The problem arises when using:

The tasks I am working on is:

To reproduce

Steps to reproduce the behavior:

1. 2. 3.

Expected behavior

keleog commented 3 years ago

@sgugger @LysandreJik

pls help

sgugger commented 3 years ago

A reported step is a training step (with an optimizer pass). When you increase gradient accumulation, you take more input batches to do one step, so it's normal to have less training steps per second.

Please note that the issues are for bugs and feature requests only, general questions like this one should go on the forums, which is why I'm closing this.

keleog commented 3 years ago

Yeah, I am aware that you take more input batches to do one step, so it's normal to have less training steps per second. However, the actual training time is much longer. Is this normal? Shouldn't it be faster or at least equals to a gradient accumulation step of 1.

sgugger commented 3 years ago

You did not report total training it. Since there are 4 (or 8) times less batches it should stay the same even if you have a slower iteration per second total.