microsoft / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
1.9k stars 345 forks source link

[NaN] Fix nan print issue when running Megatron-Deepspeed with DeepSpeed #434

Closed ys950902 closed 3 months ago

ys950902 commented 3 months ago

When we running megatron-deepspeed with deepspeed met nan issue, the only way we can judge this issue can see below is no lm loss print and the number of nan iterations is still 0 which is not correct: iteration 9/ 10 | consumed samples: 108 | consumed tokens: 442368 | elapsed time per iteration (ms): 1979.2 | learning rate: 4.219E-07 | global batch size: 12 | loss scale: 1.0 | actual seqlen: 4096 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 6.063 | tokens per gpu per second (tgs): 2069.506 | TFLOPs: 127.00 |

This pr is to fix this issue, whether is skipped iter we should do the nan check.

tjruwase commented 3 months ago

@ys950902, can you please share a bit more details about why skipped_iter is False in this case?

ys950902 commented 3 months ago

@ys950902, can you please share a bit more details about why skipped_iter is False in this case?

Hi @tjruwase, thanks for your reply, when you running Megatron-DeepSpeed with DeepSpeed for 3D parallelism: https://github.com/microsoft/Megatron-DeepSpeed/blob/main/megatron/training.py#L674 or running for zero2/3 https://github.com/microsoft/Megatron-DeepSpeed/blob/main/megatron/training.py#L762 the skipped_iter is set to 0 by default, and DeepSpeed won't update this flag, so is false here.

tjruwase commented 3 months ago

@ys950902, thanks for the explanation. I think the correct solution is to use the was_step_applied() API of DeepSpeed. And I noticed that for the non-3D parallelism case, it is already used to set update_successful. https://github.com/microsoft/Megatron-DeepSpeed/blob/53b241f992f9b3dd7917bc36472f60cb118f8303/megatron/training.py#L746

The problem is that update_successful is not used to appropriately set skipped_iter unlike the non-deepspeed code path. https://github.com/microsoft/Megatron-DeepSpeed/blob/53b241f992f9b3dd7917bc36472f60cb118f8303/megatron/training.py#L773-L778

Can you try setting update_successful and skipped_iter for both deepspeed cases in a consistent fashion to the megatron case? Thanks

ys950902 commented 3 months ago

@ys950902, thanks for the explanation. I think the correct solution is to use the was_step_applied() API of DeepSpeed. And I noticed that for the non-3D parallelism case, it is already used to set update_successful.

https://github.com/microsoft/Megatron-DeepSpeed/blob/53b241f992f9b3dd7917bc36472f60cb118f8303/megatron/training.py#L746

The problem is that update_successful is not used to appropriately set skipped_iter unlike the non-deepspeed code path.

https://github.com/microsoft/Megatron-DeepSpeed/blob/53b241f992f9b3dd7917bc36472f60cb118f8303/megatron/training.py#L773-L778

Can you try setting update_successful and skipped_iter for both deepspeed cases in a consistent fashion to the megatron case? Thanks

Got it, I will fix it as you suggested!

ys950902 commented 3 months ago

Hi @tjruwase, could you please take a look on this pr and with the modify in deepspeed to support bfloat16 https://github.com/microsoft/DeepSpeed/pull/5879.

ys950902 commented 3 months ago

Hi @tjruwase, will you merge this pr?