philschmid / aws-neuron-samples

MIT License
12 stars 1 forks source link

Unable to use Neuron Cores while fine-tuning BERT on Trainium #3

Closed DhruvaBansal00 closed 1 year ago

DhruvaBansal00 commented 1 year ago

Hey!

I am trying to follow this guide: https://huggingface.co/docs/optimum-neuron/tutorials/fine_tune_bert and fine tune BERT on a trn1.2xlarge instance. I setup the datasets as mentioned in the blog and then ran the training script but the usage of neuron cores is still at 0%. The reason why this is relevant for me is because the expected training time for me is close to 5 hours.

Screenshot 2023-06-28 at 1 10 32 PM Screenshot 2023-06-28 at 1 09 34 PM

cc: @philschmid

philschmid commented 1 year ago

Thank you for reporting can. When did you create your environment? It seems that there is an error with the new AMI. Can you use the previous one?

DhruvaBansal00 commented 1 year ago

I created the environment yesterday, using this AMI: huggingface-neuron-2023-06-26T09-27-02.137Z-692efe1a-8d5c-4033-bcbc-5d99f2d4ae6a. I can try the previous one.

DhruvaBansal00 commented 1 year ago

Trying huggingface-neuron-2023-04-20T11-02-28.279Z-692efe1a-8d5c-4033-bcbc-5d99f2d4ae6a

DhruvaBansal00 commented 1 year ago

Ok that AMI works, thanks for your quick response!

I had to undo my PR to make it work on the previous AMI - https://github.com/philschmid/aws-neuron-samples/pull/2

I am trying to train a T5 model. Do you know if this AMI can be used to train a T5 model?

philschmid commented 1 year ago

Thank you! We are working on fixing that ASAP!