[HPC] Proposal: Offer long, guaranteed benchmark stability

nvaprodromou commented 1 year ago

Introduction:

After collecting feedback from engineers, clients, and press, NVIDIA presented a list of proposals that aim to improve the popularity of the MLPerf HPC benchmark suite. Please see our slide deck for more information on our feedback gathering process and insights.

Proposal: Offer long, guaranteed benchmark stability (guarantees submission longevity)

Slide 13 in proposals slide deck.

We propose to offer guaranteed benchmark stability for some agreed-upon duration:

First appearance of benchmark labeled "beta": A beta benchmark may or may not change by the next submission round.
Second submission round drops the beta status and freezes the benchmark for X years. We guarantee that code and dataset will not be modified at all during the guaranteed lifespan.
Since benchmark is guaranteed stable, this proposal allows carrying results from prior rounds.

This proposal aims to improve the popularity of the MLPerf HPC benchmark suite by improving on the following aspects:

High submission overhead and cost [Affects participation and competition]
Enables prioritizing of MLPerf-HPC for new systems [Affects participation and competition]

Discussion

Pros:

Carrying prior results reduces overhead for participation
Guaranteed benchmark lifespan reduces effective submission overhead/cost since one submission will be valid for the next X years
Improves prioritizing MLPerf HPC for new systems, if it is designed to be more stable than MLPerf Training

Cons:

Bugs in model code and dataset will not be corrected if they are identified after beta status drops

sparticlesteve commented 1 year ago

I support making benchmark stability a goal. However, I don't think we've actually had any problems yet where someone couldn't reuse results that wanted to, have we? If not, or even if very rarely, is it necessary to formalize this? Without formalizing it, we give ourselves the flexibility to maximize stability when possible but to decide as a group on the rare cases whether to allow for improvements or bugfixes.

memani1 commented 1 year ago

Yes, I think for now this can be decided based on group consensus.

nvaprodromou commented 1 year ago

Since we haven't had an issue with it, why not formalize it? Sounds like we just need to add a paragraph in the policies document.

I think a formal guarantee that a benchmark will not change will be a strong motivator for submitters who are concerned about their "return on investment" (investment would be the total engineer/machine hours needed for submissions, return will be longevity of results). We don't have to guarantee stability for centuries. Perhaps 2-3 years? We have to retire benchmarks after a few years anyway (not a focus yet, but it will unavoidably become one over time).

The flexibility and ensuring stability argument is fair. There may be ways to remain flexible under this proposal: Off the top of my head we can, when absolutely necessary, create a fork of some benchmark that in turn has a guaranteed lifespan. So if for example we find a bug in cosmoflow during its guaranteed lifetime (ideally we won't - that's why the first year's beta status is there for), we can create a "cosmoflow 2.0" benchmark while keeping cosmoflow 1.0 still intact. Perhaps in this case, we no longer allow 1.0 submissions moving forward?

mlcommons / training_policies