Closed ZhaiFeiyue closed 1 year ago
@ankurhabana will follow this issue
@regisss Please let us know if the fix would be available in 1.7.5. We need to make the synapse 1.12.0 release. Thanks!
@asharmahabana I don't have an ETA for this fix, I currently don't have the bandwidth to work on it before next week.
Okay, so quickly looking at the changes between v1.6.1 and v1.7, I suspect that this issue comes from using FusedRoPE
during training.
@ZhaiFeiyue @asharmahabana Could you try #410 and let me know if that works on your side? You'll need a batch size of 1 if you use a single Gaudi2 node.
We probably need the same fix for Llama, I'll add it in this PR once it's approved.
@regisss @ZhaiFeiyue , our customer tried llama2 finetuning without any problem. Disabling fused RoPE will cause performance drop that our customer is currently benchmarking. @schoi-habana , please help them to debug what was wrong
@mandy-li Okay, I'll enable it again then. Was this Llama fine-tuning done with DeepSpeed?
@regisss yes the exact same command worked for Llama fine-tuning with DeepSpeed
@regisss , customer used lora finetuning for llama2 and didn't get any problem. @schoi-habana is debugging to see if DS caused the issue.
I just opened #413 to revert these changes for Llama and Falcon.
System Info
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
crash log
Expected behavior
same with 1.6.1