[NaN] Fix nan print issue when running Megatron-Deepspeed with DeepSpeed

microsoft / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Other

1.9k stars 345 forks source link

[NaN] Fix nan print issue when running Megatron-Deepspeed with DeepSpeed #434

Closed ys950902 closed 3 months ago

ys950902 commented 3 months ago

This pr is to fix this issue, whether is skipped iter we should do the nan check.

tjruwase commented 3 months ago

@ys950902, can you please share a bit more details about why skipped_iter is False in this case?

ys950902 commented 3 months ago

@ys950902, can you please share a bit more details about why skipped_iter is False in this case?

Hi @tjruwase, thanks for your reply, when you running Megatron-DeepSpeed with DeepSpeed for 3D parallelism: https://github.com/microsoft/Megatron-DeepSpeed/blob/main/megatron/training.py#L674 or running for zero2/3 https://github.com/microsoft/Megatron-DeepSpeed/blob/main/megatron/training.py#L762 the skipped_iter is set to 0 by default, and DeepSpeed won't update this flag, so is false here.

tjruwase commented 3 months ago

@ys950902, thanks for the explanation. I think the correct solution is to use the was_step_applied() API of DeepSpeed. And I noticed that for the non-3D parallelism case, it is already used to set update_successful. https://github.com/microsoft/Megatron-DeepSpeed/blob/53b241f992f9b3dd7917bc36472f60cb118f8303/megatron/training.py#L746

The problem is that update_successful is not used to appropriately set skipped_iter unlike the non-deepspeed code path. https://github.com/microsoft/Megatron-DeepSpeed/blob/53b241f992f9b3dd7917bc36472f60cb118f8303/megatron/training.py#L773-L778

Can you try setting update_successful and skipped_iter for both deepspeed cases in a consistent fashion to the megatron case? Thanks

ys950902 commented 3 months ago

@ys950902, thanks for the explanation. I think the correct solution is to use the was_step_applied() API of DeepSpeed. And I noticed that for the non-3D parallelism case, it is already used to set update_successful.

https://github.com/microsoft/Megatron-DeepSpeed/blob/53b241f992f9b3dd7917bc36472f60cb118f8303/megatron/training.py#L746

The problem is that update_successful is not used to appropriately set skipped_iter unlike the non-deepspeed code path.

https://github.com/microsoft/Megatron-DeepSpeed/blob/53b241f992f9b3dd7917bc36472f60cb118f8303/megatron/training.py#L773-L778

Can you try setting update_successful and skipped_iter for both deepspeed cases in a consistent fashion to the megatron case? Thanks

Got it, I will fix it as you suggested!

ys950902 commented 3 months ago

Hi @tjruwase, could you please take a look on this pr and with the modify in deepspeed to support bfloat16 https://github.com/microsoft/DeepSpeed/pull/5879.

ys950902 commented 3 months ago

Hi @tjruwase, will you merge this pr?