Open korneevm opened 1 month ago
what instance are you trying to use?
ml.p4d.24xlarge
like in tutorial
I've tried using ml.p4de.24xlarge
for training and it worked well. I've had to make some minor adjustments in code - I can make PR if you are interested
Yes could you share what you needed to change?
Hi Phil, thanks for the great repo and examples!
Everything worked well when I played with llama3-70b using your guide, but now I'm stuck when fine-tuning llama3.1-70b.
I've done all the steps from the https://www.philschmid.de/sagemaker-train-deploy-llama3 article and then managed to fix problems with incompatible package versions and start the training process. But on the "Loading checkpoint shards" step I'm getting an error:
I've tried to overcome this problem but no success. Maybe you could point out what I'm missing.