huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.61k stars 26.42k forks source link

Trainer reported loss is wrong when using DeepSpeed and gradient_accumulation_steps > 1 #11919

Closed rfernand2 closed 3 years ago

rfernand2 commented 3 years ago

Environment info

Who can help

@stas00, @sgugger (trainer.py)

See Also

https://github.com/microsoft/DeepSpeed/issues/1107

Information

Model I am using (Roberta)

The problem arises when using:

The tasks I am working on is:

To reproduce

Steps to reproduce the behavior:

  1. run scripts to pretrain a model with DeepSpeed on a single node with 1 GPU for N steps (gradient_accum_steps=1)
  2. run scripts to pretrain a model with DeepSpeed on a single node with 1 GPU for N steps (gradient_accum_steps=8)
  3. note that vast difference in loss reported on console by trainer.py

Expected behavior

reported loss for any number of gradient_accum_steps, nodes, or GPUs should be the mean of all losses; the same order of magnitude as shown when training with gradient_accum_steps=1, on a single node, with a single GPU.

tjruwase commented 3 years ago

Please note that the fix should involve ignoring the return value of deepspeed.backward() in this line. Or at least not updating loss with this return value since it is the scaled loss value, similar to scaled_loss in this line

sgugger commented 3 years ago

Aaaah! We had two different definitions of scaled here, I know fully understand the issue. I was thinking scaled as scaled by the gradient accumulation steps factor, not scaled as scaled by the loss scaling factor. This is an easy fix to add, will do that in a bit.

stas00 commented 3 years ago

Please note that the fix should involve ignoring the return value of deepspeed.backward() in this line. Or at least not updating loss with this return value since it is the scaled loss value, similar to scaled_loss in this line

@tjruwase, could you please review your suggestion, since I see the deepspeed code doing scaling by GAS only. Please see:

https://github.com/microsoft/DeepSpeed/blob/c697d7ae1cf5a479a8a85afa3bf9443e7d54ac2b/deepspeed/runtime/engine.py#L1142-L1143

Am I missing something?

And running tests I don't see any problem with the current code.

tjruwase commented 3 years ago

@stas00, you are right my suggestion here is not correct. I initially thought that deepspeed code scaling by GAS and exposing the scaled value to the client (HF) was the problem. But based yours and @sgugger findings, it seems there is nothing to do if HF is fine with deepspeed.backward() returning the GAS-scaled loss.

Sounds like this issue can be closed, once @rfernand2 agrees.

rfernand2 commented 3 years ago

Yes, sounds good to me.

stas00 commented 3 years ago

Closing as the same report on Deepspeed side has been closed https://github.com/microsoft/DeepSpeed/issues/1107