mlcommons / training_policies

Issues related to MLPerf™ training policies, including rules and suggested changes
https://mlcommons.org/en/groups/training
Apache License 2.0
92 stars 65 forks source link

[HPC] Proposal: Allow throughput extrapolation to large system size #508

Open nvaprodromou opened 1 year ago

nvaprodromou commented 1 year ago

Proposal depends on #507.

Introduction:

After collecting feedback from engineers, clients, and press, NVIDIA presented a list of proposals that aim to improve the popularity of the MLPerf HPC benchmark suite. Please see our slide deck for more information on our feedback gathering process and insights.

Proposal: Allow throughput extrapolation to large system size

Slide 15 in proposals slide deck.

Since the FS is no longer part of the score (as per proposal #507), there is no reason to continue running the “Throughput” (was: "weakly-scaled" - proposal #511 ) benchmark in the same way. Score extrapolation becomes sufficient.

Under this proposal, submitters only submit TTS (was: "strong scaling" - proposal #511 ). They can have multiple TTS submissions at different scales etc.

This proposal aims to improve the popularity of the MLPerf HPC benchmark suite by improving on the following aspects:

  1. High submission overhead and cost [Affects participation and competition]
  2. Enables prioritizing of MLPerf-HPC for new systems [Affects participation and competition]
  3. More competitors will be able to provide MLPerf-HPC benchmark results to their potential clients [Improves RFP interest]
  4. Isolates benchmarking of compute from FS [Improves RFP interest]
  5. Simplifies results parsing and understanding [Improves press interest]

Note: We have previously supported the current rule because it is more technically robust – but now we think it was a mistake because we can only ask so much of HPC submitters and participation is an issue.

Discussion

Pros:

  1. Throughput (was: "weakly-scaled" - proposal #511 ) runs are incredibly expensive and require large portions of the system to be reserved for long time.
  2. It touches almost every single item of the feedback we received.

Cons:

  1. Can’t prove full system availability as currently possible
sparticlesteve commented 1 year ago

My comments

memani1 commented 1 year ago

It is essential to set realistic expectations on the largest scale on a system that can be run while reporting extrapolated numbers. Else it may reflect a theoretical metric.

TheKanter commented 1 year ago

Wouldn't this encourage people to just measure on a single node and neglect network entirely? Seems like it encourages 'system scale' cherry picking...

nvaprodromou commented 1 year ago

The comments focus primarily on the technical correctness of the proposed rules. I would like to remind all of us that we set a goal to increase participation, competition, and popularity. Assuming this is the goal we care to optimize for, other aspects of the benchmark need to compromise. We are not excited about it either, but we strongly believe that prioritizing participation is worth the cost on benchmark quality in the long run.

@TheKanter correct - this excludes network and IO impact from the measurement, which is a significant part of the score with today's rules. Ideally, this proposal should be combined with proposal #507 (remove data movement from score). Assuming #507 is approved, there's no reason to force submitters to actually do the "Throughput" runs because the score is predictable and accurate without those runs.

Obviously, this is not ideal since it excludes the impact of certain system's components and it results in a theoretical peak performance measurement (similar to HPL). On the other hand, it significantly reduces the investment a potential submitter needs to put in, in order to have a submission (which was by far the loudest feedback we received). Given that the primary goal of these proposals is to increase participation, competition and popularity, proposals #507 and #508 can make a huge difference.

FYI, I learned today that one of our partners, when asked if they will submit on MLPerf-HPC v3.0, said they can't do it and cited only budget constraints. #507 along with #508 reduce the cost of submission to its minimum.

Some thoughts about the points raised by @sparticlesteve earlier:

One reason we added the throughput measurement was to make the benchmark bigger, i.e. to make it something you can run on the largest HPC systems that exist. True, that was the intention. In practice however, we ended up modeling a niche form of full-system execution and even that, we model it as the worst case scenario: Our "Throughput" benchmark essentially models a HP-tuning workload, which is arguably very useful. Forcing all instances to start at the same time however, is not representative of any real-world scenario or any real-world scheduler. We artificially impose the worst-case scenario on IO and networking. We would very rarely see this behavior in real workloads.

The throughput measurement is already optional. Submitters don't need to do it. For RFPs, people can still use the naive extrapolation method even if we measure it explicitly. True. However, if a score combines storage+compute (TTT or Throughput), it's relatively useless to many entities seeking to purchase an HPC system. It was a surprising bit of feedback, but feedback nonetheless. To be fair, this is more true for the "Throughput" benchmark than TTT.

I wonder if we could get data to help indicate how wrong the naive extrapolation is. If it's very wrong, I don't see much value in reporting such a metric. The score will be quite off. I believe we saw cases where 20% of the score was data staging (don't quote me on this though). We could potentially get a more accurate estimate from engineers who measured it. However, I wouldn't call it wrong. It's a theoretical peak performance score, much like HPL score, but still has value.

This will make it more challenging to decide what counts as a system, and potentially allows for submission on highly impractical systems like massive non-hpc cloud resources. True. We might need to apply some geographical (or connectivity?) restrictions to how we define a system, or come up with some other definition.

Would we prefer having a single model training that can scale to the largest systems today? Should we redirect our energy into finding such an alternate application rather than this metric+measurement? We would very much prefer this. In practice however, there are very few models that can scale to arbitrary sizes. Even those models often have a scale limit. Furthermore, those models (at least the ones I can imagine off the top of my head) are already included in MLPerf-T. Finally, developing such a benchmark will require tremendous amounts of resources, which translates to only a handful of MLCommons participants having the ability to even consider doing it.