huggingface / optimum-neuron

Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.
Apache License 2.0
176 stars 51 forks source link

More than 1 Trainium Instance #485

Open mathephysicist opened 4 months ago

mathephysicist commented 4 months ago

System Info

I found that I couldn't train on more than 1 trainium instance with optimum Neuron. However, if I comment out the code related to the neuroncache, then it seems to work. 

I commented out 
https://github.com/huggingface/optimum-neuron/blob/ee0c1f4104ee817daf84107776d9a2d7b92499dd/optimum/neuron/trainers.py#L132-L147

and set the path to the cache_dir from the get method, and then commented out

https://github.com/huggingface/optimum-neuron/blob/ee0c1f4104ee817daf84107776d9a2d7b92499dd/optimum/neuron/trainers.py#L200C9-L208C36

and training would work on multiple nodes

Who can help?

No response

Information

Tasks

Reproduction (minimal, reproducible, runnable)

try and run mlm.py from examples on more than 1 trainium node and get failures

Expected behavior

Should do MLM training

michaelbenayoun commented 4 months ago

Hi, it is under development in #440 and should be fixed soon.

mathephysicist commented 4 months ago

Thanks for the update @michaelbenayoun, is there an ETA for this feature or any way I can support it to ship it faster?

philschmid commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!