Closed mathephysicist closed 3 months ago
cc @muellerzr as it seems to cover trainer + TPU
Thanks for the flag @mathephysicist! Can you confirm this works (with no change in Trainer) if installing accelerate from main via pip install git+https://github.com/huggingface/accelerate@grad-accum-tpu
?
Will try that out!
That seems to uninstall/remove a lot of the Neuron packages, resulting in xla_model related issues. That may be due to environment issues? I am trying the optimum-neuron tag v0.0.18, do you think trying optimum-neuron master would resolve them?
RuntimeError: Cannot replicate if number of devices (1) is different from 32
This is because optimum is pinned to a much older version of accelerate sadly. We'll need to put in the fix here in transformers it looks like... not ideal... (though the same solution has been put into accelerate)
@mathephysicist if you want to open a PR with your one-liner, I think that'd be fine with me.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
If I use Optimum Neuron on Trainium with --gradient_accumulation_steps > 1 and training failed,
Then I modified line https://github.com/huggingface/transformers/blob/6d1f545665ac66420af9f6702d891a30c5d070ea/src/transformers/trainer.py#L1966C21-L1966C23
to include
and then set gradient_accumulation_steps > 1 and it worked.
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Use any Neuron script that uses huggingface trainer and works, and set --gradient_accumulation_steps 2
Expected behavior
Should do gradient_accumulation