fine tuning the updated Phi-2 with flash-attn-2 produces very high loss > 2

System Info

The updated code of phi-2 produces a high loss, I have tried fp16, bf16, deepspeed and fsdp the result is the same -> loss starts at 2 and keeps going higher. Setting use_flash_attention_2=False fixes this or using the old phi-2 modeling file.

torch==2.1.2 flash-attn==2.4.2 transformers==4.37.0.dev0

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Fine-tune the updated phi-2 model using transformers trainer

Expected behavior

Loss go down

huggingface / transformers

fine tuning the updated Phi-2 with flash-attn-2 produces very high loss > 2 #28488