Open inkcherry opened 5 months ago
Even with this fix, I'm still facing loss=nan issues when trying to run the llama2 pre-training on single/multiple nodes with BF16, ZeRO stage 1, --use-rotary-position-embeddings and a sequence length of 4096. Could you kindly help.
based on ( https://github.com/microsoft/Megatron-DeepSpeed/pull/392) , we got NAN loss during long centext training using ds-sp(ulysses) for a llama style model.
We found that this issue is caused by precision problems. Half precision of rope sequence representation leads to loss in long context. Similar modifications have also been applied to the transformers.
https://github.com/huggingface/transformers/blob/63fb253df0d976b95d9b4b9a7b0012e5f8a37896/src/transformers/models/llama/modeling_llama.py#L111