Closed ys950902 closed 3 months ago
@ys950902, can you please share a bit more details about why skipped_iter
is False in this case?
@ys950902, can you please share a bit more details about why
skipped_iter
is False in this case?
Hi @tjruwase, thanks for your reply, when you running Megatron-DeepSpeed with DeepSpeed for 3D parallelism: https://github.com/microsoft/Megatron-DeepSpeed/blob/main/megatron/training.py#L674 or running for zero2/3 https://github.com/microsoft/Megatron-DeepSpeed/blob/main/megatron/training.py#L762 the skipped_iter is set to 0 by default, and DeepSpeed won't update this flag, so is false here.
@ys950902, thanks for the explanation. I think the correct solution is to use the was_step_applied()
API of DeepSpeed. And I noticed that for the non-3D parallelism case, it is already used to set update_successful
.
https://github.com/microsoft/Megatron-DeepSpeed/blob/53b241f992f9b3dd7917bc36472f60cb118f8303/megatron/training.py#L746
The problem is that update_successful
is not used to appropriately set skipped_iter
unlike the non-deepspeed code path.
https://github.com/microsoft/Megatron-DeepSpeed/blob/53b241f992f9b3dd7917bc36472f60cb118f8303/megatron/training.py#L773-L778
Can you try setting update_successful
and skipped_iter
for both deepspeed cases in a consistent fashion to the megatron case? Thanks
@ys950902, thanks for the explanation. I think the correct solution is to use the
was_step_applied()
API of DeepSpeed. And I noticed that for the non-3D parallelism case, it is already used to setupdate_successful
.The problem is that
update_successful
is not used to appropriately setskipped_iter
unlike the non-deepspeed code path.Can you try setting
update_successful
andskipped_iter
for both deepspeed cases in a consistent fashion to the megatron case? Thanks
Got it, I will fix it as you suggested!
Hi @tjruwase, could you please take a look on this pr and with the modify in deepspeed to support bfloat16 https://github.com/microsoft/DeepSpeed/pull/5879.
Hi @tjruwase, will you merge this pr?
When we running megatron-deepspeed with deepspeed met nan issue, the only way we can judge this issue can see below is no lm loss print and the number of nan iterations is still 0 which is not correct: iteration 9/ 10 | consumed samples: 108 | consumed tokens: 442368 | elapsed time per iteration (ms): 1979.2 | learning rate: 4.219E-07 | global batch size: 12 | loss scale: 1.0 | actual seqlen: 4096 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 6.063 | tokens per gpu per second (tgs): 2069.506 | TFLOPs: 127.00 |
This pr is to fix this issue, whether is skipped iter we should do the nan check.