MLPerf HPC Weak scaling metric proposal

mlcommons / training_policies

Issues related to MLPerf™ training policies, including rules and suggested changes

https://mlcommons.org/en/groups/training

Apache License 2.0

93 stars 66 forks source link

MLPerf HPC Weak scaling metric proposal #458

Closed azrael417 closed 3 years ago

azrael417 commented 3 years ago

Sear Sir/Madam,

this PR contains the rules changes for MLperf HPC concerning the updated performance metrics. Please consider this draft WIP as some details are still being discussed. As such, I encourage members of the MLPerf HPC WG to comment on the PR and refine it before it can ultimately be merged.

Best regards Thorsten

github-actions[bot] commented 3 years ago

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

sparticlesteve commented 3 years ago

Copying some of my comments here from slack, for posterity:

Using max(end_time) - min(start_time) will mean the result is entirely driven by the slowest model, and is thus highly sensitive to outliers (i.e. bad luck) in training variability. This is of course why we use olympic scoring in the traditional metric reporting. A possible way to mitigate this is to run K models, but drop the last and compute the time using just the K-1 runs.

David had a further suggestion, which I think was along the lines of using the average runtime across the K concurrent runs. This is more resilient to variability. However, I think it would require us to enforce that all jobs actually run at the same time.

sparticlesteve commented 3 years ago

More discussion points copied from slack, for posterity:

How exactly do we define a “system”? What level of heterogeneity (partitions with different hardware) do we allow? Do we allow system partitions to be spread out geographically (probably not useful)?
Should we impose a minimum (or fixed) batch size? This would allow us some control over runtime and disincentivizing folks from running smallest possible batch size for longest possible system run time (which disadvantages folks who cannot get that amount of system access)

azrael417 commented 3 years ago

How exactly do we define a “system”? What level of heterogeneity (partitions with different hardware) do we allow? Do we allow system partitions to be spread out geographically (probably not useful)?

Actually I think this is irrelevant for this metric, since you have to run at the scale you report. So if you can run on all your cloud instances together that should be fine. It is unlikely that someone is going to do that anyway. That means we probably do not need to define what a system is.

azrael417 commented 3 years ago

Should we impose a minimum (or fixed) batch size? This would allow us some control over runtime and disincentivizing folks from running smallest possible batch size for longest possible system run time (which disadvantages folks who cannot get that amount of system access)

No, imo that should be open. You can also get a large minimum batch size on a single accelerator if you use gradient accumulation for example. I think we should just leave that open.

sparticlesteve commented 3 years ago

@johntran-nv could you take a look and merge?

sparticlesteve commented 3 years ago

@azrael417 can you remove WIP from the title? I think it is ready to be merged.