mlcommons / training_policies

Issues related to MLPerf™ training policies, including rules and suggested changes
https://mlcommons.org/en/groups/training
Apache License 2.0
93 stars 66 forks source link

Add v1.0 specific rules #431

Closed johntran-nv closed 3 years ago

johntran-nv commented 3 years ago

Adding rules that we've agreed on the last few SWG meetings.

github-actions[bot] commented 3 years ago

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

nvcforster commented 3 years ago

Suggested BERT v1.0 clip-norm rule edit: For v1.0 only, BERT submissions may implement clip-norm either before or after inter-accelerator all-reduce. For future rounds, the expectation is that submissions must use clip-norm-after-reduce, to be consistent with most commonly used public BERT model repos. For performance consistency of at scale BERT submissions for v1.0, submitters are disallowed from using clip-norm-after-reduce to enable additional overlap of communication and math. If a submitter plans to use clip-norm-after-reduce for v1.0, they must notify the committee before the submission deadline, and be prepared to show code in their submission proving that they do not do overlap as a result of clip-norm-after-reduce.

johntran-nv commented 3 years ago

@emizan76 , any other comments on this one? I fixed up some conflicts with master, so this should be ready to merge now.

emizan76 commented 3 years ago

I wanted to discuss the wording or the meaning of the MaskRCNN convergence with different backbones. Actually it was one of the bullet points I wanted to bring up in the previous meeting. We think that if a submission with a different backbone passes the RCP check then it should not be treated differently, even if it converges differently (which in this case is more slowly).

Let me know what you think, if it urgent to have this merged, we can merge it now and discuss it next week,

Elias

On Wed, Apr 21, 2021 at 8:00 PM johntran-nv @.***> wrote:

@emizan76 https://github.com/emizan76 , any other comments on this one? I fixed up some conflicts with master, so this should be ready to merge now.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mlcommons/training_policies/pull/431#issuecomment-824502131, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKN472KYJ2VARZ2437MH6EDTJ6GMDANCNFSM4ZYRGONQ .

johntran-nv commented 3 years ago

@emizan76 , like I said, I don't feel strongly, so went ahead and changed it to "faster" as you suggest. With that, are you ok with merging?