More than 1 Trainium Instance

mathephysicist commented 4 months ago

System Info

I found that I couldn't train on more than 1 trainium instance with optimum Neuron. However, if I comment out the code related to the neuroncache, then it seems to work. 

I commented out 
https://github.com/huggingface/optimum-neuron/blob/ee0c1f4104ee817daf84107776d9a2d7b92499dd/optimum/neuron/trainers.py#L132-L147

and set the path to the cache_dir from the get method, and then commented out

https://github.com/huggingface/optimum-neuron/blob/ee0c1f4104ee817daf84107776d9a2d7b92499dd/optimum/neuron/trainers.py#L200C9-L208C36

and training would work on multiple nodes

Who can help?

No response

Information

[x] The official example scripts
[x] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[x] My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

try and run mlm.py from examples on more than 1 trainium node and get failures

Expected behavior

Should do MLM training

michaelbenayoun commented 4 months ago

Hi, it is under development in #440 and should be fixed soon.

mathephysicist commented 4 months ago

Thanks for the update @michaelbenayoun, is there an ETA for this feature or any way I can support it to ship it faster?

philschmid commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

huggingface / optimum-neuron