Closed rfernand2 closed 3 years ago
Aaaah! We had two different definitions of scaled here, I know fully understand the issue. I was thinking scaled as scaled by the gradient accumulation steps factor, not scaled as scaled by the loss scaling factor. This is an easy fix to add, will do that in a bit.
Please note that the fix should involve ignoring the return value of
deepspeed.backward()
in this line. Or at least not updating loss with this return value since it is the scaled loss value, similar toscaled_loss
in this line
@tjruwase, could you please review your suggestion, since I see the deepspeed code doing scaling by GAS only. Please see:
Am I missing something?
And running tests I don't see any problem with the current code.
@stas00, you are right my suggestion here is not correct. I initially thought that deepspeed code scaling by GAS and exposing the scaled value to the client (HF) was the problem. But based yours and @sgugger findings, it seems there is nothing to do if HF is fine with deepspeed.backward()
returning the GAS-scaled loss.
Sounds like this issue can be closed, once @rfernand2 agrees.
Yes, sounds good to me.
Closing as the same report on Deepspeed side has been closed https://github.com/microsoft/DeepSpeed/issues/1107
Environment info
transformers
version: 4.7.0.dev0Who can help
@stas00, @sgugger (trainer.py)
See Also
https://github.com/microsoft/DeepSpeed/issues/1107
Information
Model I am using (Roberta)
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
Expected behavior
reported loss for any number of gradient_accum_steps, nodes, or GPUs should be the mean of all losses; the same order of magnitude as shown when training with gradient_accum_steps=1, on a single node, with a single GPU.