Work Estimate / Blockers for References (gradient accumulation and convergence understanding)

mlcommons / training_policies

Issues related to MLPerf™ training policies, including rules and suggested changes

https://mlcommons.org/en/groups/training

Apache License 2.0

93 stars 66 forks source link

Work Estimate / Blockers for References (gradient accumulation and convergence understanding) #384

Open bitfort opened 4 years ago

bitfort commented 4 years ago

Reference owners please update yellow cells with a work estimate (in weeks/days) or a blocking issue that needs to be resolved. https://docs.google.com/spreadsheets/d/1W8L8SBIrgbJ_f_-2hUt8SqLNkzAvsKNkQ0A6pKWz9_8/edit#gid=0

bitfort commented 4 years ago

SWG:

Will request reference owners add status updates next week to the spreadsheet.

bitfort commented 3 years ago

We had requested a status update in this spreadsheet: https://docs.google.com/spreadsheets/d/1W8L8SBIrgbJ_f_-2hUt8SqLNkzAvsKNkQ0A6pKWz9_8/edit#gid=0

We will touch base next week:

Does it need gradient accumulation?
Status on adding gradient accumulation?
Convergence Curve - https://drive.google.com/drive/u/0/folders/1sDmlkLyehFcQWEEW8IhQUbLafaPhTE-9

Convergence Curves: Run 2x the required runs for submission spread across the historically min submitted batch size and max submitted batch sizes -- running a powers of 2 start at min going to max.

johntran-nv commented 3 years ago

In addition to gradient accumulation and convergence curves, we also need to update logging to the latest v0.7 (or v1.0?) spec. I've added a column in the status spreadsheet for this.

johntran-nv commented 3 years ago

From this week's meeting:

Tracking spreadsheet: https://docs.google.com/spreadsheets/d/1W8L8SBIrgbJ_f_-2hUt8SqLNkzAvsKNkQ0A6pKWz9_8/edit?usp=sharing
Reminded group that we plan to freeze on 1/22
Review existing pull requests https://github.com/mlcommons/training/pulls, all are to be resolved in the next 2 weeks
If you lack permissions to contribute, please update
We should make a label to indicate which PRs are going to impact the references
Follow up over email on status of Minigo
In progress work for every reference.
New action item for all reference owners to fix logging, not needed by freeze but we want shortly after. [AI JohnT] Email to be sent to owners. (sent on 1/7/21)
Want convergence curves by freeze deadline. Let others know if you need help with this.