huggingface / optimum-neuron

Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.
Apache License 2.0
177 stars 53 forks source link

run_clm.py (+ other examples) output checkpoints in sharded format #349

Open 5cp opened 7 months ago

5cp commented 7 months ago

When training with examples like run_clm.py and using TP, the checkpoints are saved in sharded format and require consolidation before they can be used with downstream tools.

For example, the following code can be added to the end of the training block in run_clm.py to perform consolidation and remove the shards:

    if (int(os.environ.get("RANK", -1)) == 0) and int(training_args.tensor_parallel_size) > 1:
        print("Converting sharded checkpoint to consolidated format")
        from optimum.neuron.distributed.checkpointing import consolidate_tensor_parallel_checkpoints_to_unified_checkpoint
        from shutil import rmtree
        consolidate_tensor_parallel_checkpoints_to_unified_checkpoint(
            training_args.output_dir,
            training_args.output_dir,
            "pytorch"
        )
        rmtree(os.path.join(training_args.output_dir, "tensor_parallel_shards"))  # remove sharded checkpoint files

Should this be the default behaviour? Or should we add a flag to automatically trigger this consolidation?

HuggingFaceDocBuilderDev commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

HuggingFaceDocBuilderDev commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

HuggingFaceDocBuilderDev commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

HuggingFaceDocBuilderDev commented 4 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

HuggingFaceDocBuilderDev commented 3 days ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!