Closed samir-souza closed 10 months ago
The output you give is the current way it is working? Because it seems fine to me.
Yes, I just ran a Training job for bert-base-uncased (binary text classification) with 2 cores and that was the result of save_model for me. I tried both 0.0.8 and 0.0.9, but the result is the same. I also tried 32 cores and I got 32 checkpoints!! Here you can see the training code: https://github.com/aws-samples/ml-specialized-hardware/blob/main/purpose-built-accelerators/notebooks/02_ModelFineTuning.ipynb
@michaelbenayoun I re-built my training script and realized it was my mistake. I managed to get the correct results. Closing this ticket, then. Thanks.
Trainer.save_model invokes save_model from super class, which doesn't handle multi-core dist training correctly for Trainium. The side effect is that it dumps one copy (checkpoint) of the trained model per core. This creates a huge amount of data, specially for big models trained with 32 cores. It would be better to invoke: xm.save that takes care of this complexity and saves only 1 copy of the model.
For instance. This is the output of a 2-cores training job: