Neuron Trainium --Gradient_Accumulation_Steps > 1

huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

https://huggingface.co/transformers

Apache License 2.0

128.71k stars 25.53k forks source link

Neuron Trainium --Gradient_Accumulation_Steps > 1 #29042

Closed mathephysicist closed 3 months ago

mathephysicist commented 4 months ago

System Info

If I use Optimum Neuron on Trainium with --gradient_accumulation_steps > 1 and training failed,

Then I modified line https://github.com/huggingface/transformers/blob/6d1f545665ac66420af9f6702d891a30c5d070ea/src/transformers/trainer.py#L1966C21-L1966C23

to include

if is_torch_tpu_available() :
  xm.mark_step()

and then set gradient_accumulation_steps > 1 and it worked.

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Use any Neuron script that uses huggingface trainer and works, and set --gradient_accumulation_steps 2

Expected behavior

Should do gradient_accumulation

amyeroberts commented 4 months ago

cc @muellerzr as it seems to cover trainer + TPU

muellerzr commented 4 months ago

Thanks for the flag @mathephysicist! Can you confirm this works (with no change in Trainer) if installing accelerate from main via pip install git+https://github.com/huggingface/accelerate@grad-accum-tpu?

mathephysicist commented 4 months ago

Will try that out!

mathephysicist commented 4 months ago

That seems to uninstall/remove a lot of the Neuron packages, resulting in xla_model related issues. That may be due to environment issues? I am trying the optimum-neuron tag v0.0.18, do you think trying optimum-neuron master would resolve them?

mathephysicist commented 4 months ago

RuntimeError: Cannot replicate if number of devices (1) is different from 32

muellerzr commented 4 months ago

This is because optimum is pinned to a much older version of accelerate sadly. We'll need to put in the fix here in transformers it looks like... not ideal... (though the same solution has been put into accelerate)

muellerzr commented 4 months ago

@mathephysicist if you want to open a PR with your one-liner, I think that'd be fine with me.

github-actions[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.