mlcommons / training_policies

Issues related to MLPerf™ training policies, including rules and suggested changes
https://mlcommons.org/en/groups/training
Apache License 2.0
92 stars 66 forks source link

[MLPerfHPC] Re-submissions due to hardware failures #480

Closed nvaprodromou closed 2 years ago

nvaprodromou commented 2 years ago

This PR defines what happens when hardware failures cause a weakly-scaled run to miss the submission deadline. This was a point of discussion during the MLPerf HPC v0.7 review meetings.

github-actions[bot] commented 2 years ago

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

sparticlesteve commented 2 years ago

This was reviewed and approved in the May 16 HPC WG meeting.