Investigate allowing async training for DLRM

mlcommons / training_policies

Issues related to MLPerf™ training policies, including rules and suggested changes

https://mlcommons.org/en/groups/training

Apache License 2.0

91 stars 65 forks source link

Investigate allowing async training for DLRM #273

Open bitfort opened 4 years ago

bitfort commented 4 years ago

This is a longer term issue to explore. Large models, such as a large recommender like DLRM, can benefit from async parts of the pipeline. These are optimizations commonly used to make large models performant in production; "the math changes on paper but not in practice". Some practitioners think this is an important optimization for certain models. It isn't clear how this fits into the rules for MLPerf, specifically closed division. It could make sense to allow this in closed because it is a relatively "vanilla" optimization, but it isn't how rules would be structured for this.

bitfort commented 4 years ago

AI(Tayo) -- sync with model owners on this idea.

nvpaulius commented 4 years ago

This is not allowed by the current rules - there was a ruling on stale gradients, which is what this would be: https://github.com/mlperf/training_policies/issues/36

robieta commented 4 years ago

Indeed. This is specifically a proposal to allow asynchrony for the embedding lookups of DLRM (not allowed by the current rules); all other updates would still be required to be synchronous.

tayo commented 4 years ago

This was discussed in the Special Topics meeting on January 30.

There were no objections to allowing this for DLRM. The proposer needs to make the case for this and have this agreed upon before the HParam deadline. The case should take the form of general proof that such overlapping is used, e.g. papers, testimonials.

bitfort commented 4 years ago

Postponed until next round.