mlcommons / algorithmic-efficiency

MLCommons Algorithmic Efficiency is a benchmark and competition measuring neural network training speedups due to algorithmic improvements in both training algorithms and models.
https://mlcommons.org/en/groups/research-algorithms/
Apache License 2.0
324 stars 62 forks source link

Inform submission about `accumulated_submission_time` #785

Open Niccolo-Ajroldi opened 1 month ago

Niccolo-Ajroldi commented 1 month ago

Description

Currently, update_params has no up-to-date information about the elapsed time since start.

My motivation for adding this feature is to simplify the implementation of a time-based learning rate schedule.

Can't a submission just keep track of time or estimate it? In theory yes, this is allowed by the rules and feasible. However, such implementation would require synchronization among devices inside update_params when training in distributed mode, which would penalize such submission.

Why is a time-based scheduler useful? Currently, a submission can implement a LR scheduler using step_hint as a step budget. This is a reliable estimate of the number of steps needed for (N)AdamW to reach max_runtime. However, a submission could be faster/slower than (N)AdamW, and the extent of this difference can vary based on the workload itself. This makes deriving a custom step budget from step_hint suboptimal.

Implementation

We could simply pass train_state to update_params, or even just train_state['accumulated_submission_time']. Notice that update_params already receives in input eval_results, which informs the submission about the elapsed time at the moment of the last evaluation, which is not up-to date with accumulated_submission_time.

Niccolo-Ajroldi commented 1 week ago

Remark: we should avoid that a submission modifies train_state.

A straightforward solution would be to pass it by copy. A shallow copy should be enough, since all values in train_state are primitive types, and cannot be modified in-place. Although very fast, this copying operation would however be counted as part of the submission time.

Another solution would be to just pass train_state['accumulated_submission_time'] instead.