utterworks / fast-bert

Super easy library for BERT based NLP models
Apache License 2.0
1.85k stars 342 forks source link

Error for training job failed. reason: algorithmerror: exit code: 127 - SageMaker #258

Closed katreparitosh closed 3 years ago

katreparitosh commented 3 years ago

Hello,

I was training a DistilBERT model on SageMaker instance using fast-bert. I am using the ml.p2.xlarge instance for GPU processing.

When the function downloads the training image from ECR during fit(), I happen to receive "/usr/bin/env: ‘python\r’: No such file or directory". See below -

image

And, at the end of stack-trace received the following - error for training job failed. reason: algorithmerror: exit code: 127

image

Tech Stack-

fast-bert docker image SageMaker NB Instance - ml.t2.medium GPU Compute - ml.p2.xlarge

What could be the reason for this error? My IAM role has all the required permissions.

Kindly help.

katreparitosh commented 3 years ago

Sharing the solution to this problem -

Use Notepad++ to replace \r\n with \n in every file. It is an EOL issue.

Closing the issue now.