huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.32k stars 872 forks source link

Accelerate 0.31.0 gradient accumulation bug. #2866

Open nikitabalabin opened 2 weeks ago

nikitabalabin commented 2 weeks ago

System Info

I have updated to accelerate 0.31.0 from 0.30.0 and all my trainings with gradient_accumulation_steps > 1 started to collapse. Please double check that everything is ok.

Reproduction

mixed_precision='fp16' gradient_accumulation_steps > 1

Expected behavior

the training should be stable with both gradient_accumulation_steps = 1 and gradient_accumulation_steps > 1

BenjaminBossan commented 2 weeks ago

all my trainings with gradient_accumulation_steps > 1 started to collapse.

Could you please provide more details. What does "collapse" mean?

Moreover, could you share your accelerate env and if possible, the code to reproduce the failing training?