huggingface / optimum-neuron

Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.
Apache License 2.0
177 stars 54 forks source link

Trainer.save_model dumps one checkpoint per neuron core #187

Closed samir-souza closed 10 months ago

samir-souza commented 11 months ago

Trainer.save_model invokes save_model from super class, which doesn't handle multi-core dist training correctly for Trainium. The side effect is that it dumps one copy (checkpoint) of the trained model per core. This creates a huge amount of data, specially for big models trained with 32 cores. It would be better to invoke: xm.save that takes care of this complexity and saves only 1 copy of the model.

   def save_model(self, output_dir: Optional[str] = None, _internal_call: bool = False):
        if output_dir is None:
            output_dir = self.args.output_dir
        if self.accelerator.distributed_type is NeuronDistributedType.XLA_FSDP:
            self.accelerator.state.fsdp_plugin.save_model(self.accelerator, self.model, output_dir, 0)
        elif self.accelerator.distributed_type is NeuronDistributedType.TENSOR_PARALLELISM:
            parallelizer = ParallelizersManager.parallelizer_for_model(self.model)
            parallelizer.save_model_checkpoint(self.model, output_dir, as_regular=False)
        else:
         (this line)-->   return super().save_model(output_dir=output_dir, _internal_call=_internal_call)

For instance. This is the output of a 2-cores training job:

1.3G    model/checkpoint-1022
1.3G    model/checkpoint-9198

model/checkpoint-9198
model/checkpoint-9198/rng_state_1.pth
model/checkpoint-9198/rng_state_0.pth
model/checkpoint-9198/config.json
model/checkpoint-9198/training_args.bin
model/checkpoint-9198/optimizer.pt
model/checkpoint-9198/trainer_state.json
model/checkpoint-9198/pytorch_model.bin
model/checkpoint-9198/scheduler.pt
model/checkpoint-1022
model/checkpoint-1022/rng_state_1.pth
model/checkpoint-1022/rng_state_0.pth
model/checkpoint-1022/config.json
model/checkpoint-1022/training_args.bin
model/checkpoint-1022/optimizer.pt
model/checkpoint-1022/trainer_state.json
model/checkpoint-1022/pytorch_model.bin
model/checkpoint-1022/scheduler.pt
michaelbenayoun commented 11 months ago

The output you give is the current way it is working? Because it seems fine to me.

samir-souza commented 11 months ago

Yes, I just ran a Training job for bert-base-uncased (binary text classification) with 2 cores and that was the result of save_model for me. I tried both 0.0.8 and 0.0.9, but the result is the same. I also tried 32 cores and I got 32 checkpoints!! Here you can see the training code: https://github.com/aws-samples/ml-specialized-hardware/blob/main/purpose-built-accelerators/notebooks/02_ModelFineTuning.ipynb

samir-souza commented 10 months ago

@michaelbenayoun I re-built my training script and realized it was my mistake. I managed to get the correct results. Closing this ticket, then. Thanks.