mlcommons / algorithmic-efficiency

MLCommons Algorithmic Efficiency is a benchmark and competition measuring neural network training speedups due to algorithmic improvements in both training algorithms and models.
https://mlcommons.org/en/groups/research-algorithms/
Apache License 2.0
319 stars 60 forks source link

Allow DDP checkpointing #755

Closed pomonam closed 3 months ago

pomonam commented 4 months ago

This PR also suppresses the NCCL error message when the user types "N" to resume training.

github-actions[bot] commented 4 months ago

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅