microsoft / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
1.9k stars 346 forks source link

fix NAN loss of rope long context training #399

Open inkcherry opened 5 months ago

inkcherry commented 5 months ago

based on ( https://github.com/microsoft/Megatron-DeepSpeed/pull/392) , we got NAN loss during long centext training using ds-sp(ulysses) for a llama style model.

We found that this issue is caused by precision problems. Half precision of rope sequence representation leads to loss in long context. Similar modifications have also been applied to the transformers.
https://github.com/huggingface/transformers/blob/63fb253df0d976b95d9b4b9a7b0012e5f8a37896/src/transformers/models/llama/modeling_llama.py#L111

shrutiramesh1988 commented 5 months ago

Even with this fix, I'm still facing loss=nan issues when trying to run the llama2 pre-training on single/multiple nodes with BF16, ZeRO stage 1, --use-rotary-position-embeddings and a sequence length of 4096. Could you kindly help.