mlcommons / training_policies

Issues related to MLPerf™ training policies, including rules and suggested changes
https://mlcommons.org/en/groups/training
Apache License 2.0
92 stars 66 forks source link

[HPC] Need to clarify validation requirements for pruned logs in weak-scaling #493

Closed sparticlesteve closed 1 year ago

sparticlesteve commented 2 years ago

Our rules now describe how to submit pruned logs in the weak-scaling results to establish proven scale which can be used for hyperparameter borrowing. However, they do not describe the requirements for those pruned logs to be valid.

We discussed some potential approaches in our meeting on Monday, Jun 13:

sparticlesteve commented 2 years ago

My thinking is that pruned logs should be fully compliant and demonstrate a successful converged training instance. This is a straightforward requirement that I think adheres to our intent behind result pruning in the weak-scaling submissions. Result pruning helps mitigate the effects from straggler training instances that negatively effect the measured throughput when we measure time-to-train-all. I don't recall us intending to use pruning to help mitigate hardware failures, and our use of the term "proven scale" to me implies that these results should show that the system can actually run successfully at that scale (without crashing).

I fear that allowing invalid log files in the proven scale may enable some undesired behavior. Submitters could intentionally run on "bad nodes" and submit junk log files just to allow them to run at a larger scale after the deadline (e.g. after replacing nodes).

If there is a strong group consensus to use result pruning as a way to help mitigate hardware failures, I would be supportive of relaxing requirements on the pruned log files.

Finally, I believe that without clarifying this rule, by default I think it is implied that logs should always be considered compliant.

coquelin77 commented 2 years ago

I agree with your default interpretation, that the logs submitted should be compliant.

If we were to allow for failed runs, I think that we should specify a minimum percentage of successful runs

sparticlesteve commented 2 years ago

Hi @coquelin77. Thanks for your input.

For the non-pruned logs used to compute the throughput we do have the requirement that all logs are successful and that there must be at least as many as needed for the time-to-train measurement (i.e. 5 for deepcam, 10 for cosmoflow, 5 for open_catalyst). Is this what you are suggesting, or are you suggesting that we should have a requirement on the percentage of successful pruned runs?

sparticlesteve commented 2 years ago

In our meeting last week, July 25, we tentatively decided to adopt the simple solution, which is to interpret our rules as requiring pruned logs to be compliant. Nobody is specifically pushing for relaxed pruned log requirements, and most pruned logs from last year were actually pruned due to slow convergence (large number of epochs).

In today's meeting we discussed it a bit again. We agreed to uphold that decision but also agreed we should watch what happens this year closely.