mlcommons / training_policies

Issues related to MLPerf™ training policies, including rules and suggested changes
https://mlcommons.org/en/groups/training
Apache License 2.0
93 stars 66 forks source link

Work Estimate / Blockers for References (gradient accumulation and convergence understanding) #384

Open bitfort opened 4 years ago

bitfort commented 4 years ago

Reference owners please update yellow cells with a work estimate (in weeks/days) or a blocking issue that needs to be resolved. https://docs.google.com/spreadsheets/d/1W8L8SBIrgbJ_f_-2hUt8SqLNkzAvsKNkQ0A6pKWz9_8/edit#gid=0

bitfort commented 4 years ago

SWG:

Will request reference owners add status updates next week to the spreadsheet.

bitfort commented 3 years ago

We had requested a status update in this spreadsheet: https://docs.google.com/spreadsheets/d/1W8L8SBIrgbJ_f_-2hUt8SqLNkzAvsKNkQ0A6pKWz9_8/edit#gid=0

We will touch base next week:

  1. Does it need gradient accumulation?
  2. Status on adding gradient accumulation?
  3. Convergence Curve - https://drive.google.com/drive/u/0/folders/1sDmlkLyehFcQWEEW8IhQUbLafaPhTE-9

Convergence Curves: Run 2x the required runs for submission spread across the historically min submitted batch size and max submitted batch sizes -- running a powers of 2 start at min going to max.

johntran-nv commented 3 years ago

In addition to gradient accumulation and convergence curves, we also need to update logging to the latest v0.7 (or v1.0?) spec. I've added a column in the status spreadsheet for this.

johntran-nv commented 3 years ago

From this week's meeting: