mlcommons / training_policies

Issues related to MLPerf™ training policies, including rules and suggested changes
https://mlcommons.org/en/groups/training
Apache License 2.0
92 stars 66 forks source link

RCP normalization to mean #503

Closed nv-rborkar closed 2 years ago

nv-rborkar commented 2 years ago

This PR proposes:

github-actions[bot] commented 2 years ago

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

johntran-nv commented 2 years ago

Training WG approved today, for the v2.1 round.

@sparticlesteve and @memani1 , can you discuss in the HPC working group and decide if/when you would want to adopt this?

sparticlesteve commented 2 years ago

Is there a motivation for this written up somewhere? It complicates the score calculation for what must be fairly minor effects. I guess it's because of folks in the lead fighting over the few % differences. It feels weird to only penalize results that are faster than the rcp mean. If we're normalizing the idea of normalizing results, why not apply it more broadly, i.e. to all results within the tolerance band? It would be easier to explain/justify.

johntran-nv commented 2 years ago

Good questions.

First, we realized that the current tolerance band could be interpreted as incentivizing submitters to cherry pick. Rather than getting the RCP mean, you could just keep running until you get the RCP mean minus the tolerance, and then you have an advantage over others. This rule is hoping to discourage that, as you would just get normalized to the RCP mean if you go too fast, and this way no one is incentivized to cherry pick beyond the RCP mean.

The reason why it's one-sided and not two-sided is because we as a community know of ways to trade off convergence for throughput. One simple case is increasing batch size - you get higher throughput per device, but your convergence falls off so you have to run for longer. So slowing down convergence is a known technique for increasing throughput, so we don't necessarily want to normalize slower runs. We don't know of ways to improve your convergence while increasing throughput, so that's why it makes sense to normalize only the faster side.

TheKanter commented 2 years ago

John - this is an excellent write-up, do we have it down anywhere else? Could make for good documentation.

On Thu, Sep 15, 2022 at 1:54 PM johntran-nv @.***> wrote:

Good questions.

First, we realized that the current tolerance band could be interpreted as incentivizing submitters to cherry pick. Rather than getting the RCP mean, you could just keep running until you get the RCP mean minus the tolerance, and then you have an advantage over others. This rule is hoping to discourage that, as you would just get normalized to the RCP mean if you go too fast, and this way no one is incentivized to cherry pick beyond the RCP mean.

The reason why it's one-sided and not two-sided is because we as a community know of ways to trade off convergence for throughput. One simple case is increasing batch size - you get higher throughput per device, but your convergence falls off so you have to run for longer. So slowing down convergence is a known technique for increasing throughput, so we don't necessarily want to normalize slower runs. We don't know of ways to improve your convergence while increasing throughput, so that's why it makes sense to normalize only the faster side.

— Reply to this email directly, view it on GitHub https://github.com/mlcommons/training_policies/pull/503#issuecomment-1248619349, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJXLOKZNCYEIAOFX3REGXKTV6OEIZANCNFSM5663EO4Q . You are receiving this because you are subscribed to this thread.Message ID: @.***>

johntran-nv commented 2 years ago

@TheKanter , I don't know of a good place to explain the rules, other than just pointing people at the rules or having them read the WG minutes. Is there another place for that?

@sparticlesteve , do you think we've had enough soak time for HPC folks, so we can merge this now?

sparticlesteve commented 2 years ago

@johntran-nv I shared it with the group and brought it up in a couple of meetings, so yeah, I think we're good. Thanks for giving us the chance to provide input, and thanks for your explanations.