Open nikitabalabin opened 2 weeks ago
all my trainings with gradient_accumulation_steps > 1 started to collapse.
Could you please provide more details. What does "collapse" mean?
Moreover, could you share your accelerate env
and if possible, the code to reproduce the failing training?
System Info
Reproduction
mixed_precision='fp16' gradient_accumulation_steps > 1
Expected behavior
the training should be stable with both gradient_accumulation_steps = 1 and gradient_accumulation_steps > 1