Closed Niccolo-Ajroldi closed 3 weeks ago
Remark: we should avoid that a submission modifies train_state
.
A straightforward solution would be to pass it by copy. A shallow copy should be enough, since all values in train_state
are primitive types, and cannot be modified in-place. Although very fast, this copying operation would however be counted as part of the submission time.
Another solution would be to just pass train_state['accumulated_submission_time']
instead.
Description
Currently,
update_params
has no up-to-date information about the elapsed time since start.My motivation for adding this feature is to simplify the implementation of a time-based learning rate schedule.
Can't a submission just keep track of time or estimate it? In theory yes, this is allowed by the rules and feasible. However, such implementation would require synchronization among devices inside
update_params
when training in distributed mode, which would penalize such submission.Why is a time-based scheduler useful? Currently, a submission can implement a LR scheduler using
step_hint
as a step budget. This is a reliable estimate of the number of steps needed for (N)AdamW to reach max_runtime. However, a submission could be faster/slower than (N)AdamW, and the extent of this difference can vary based on the workload itself. This makes deriving a custom step budget fromstep_hint
suboptimal.Implementation
We could simply pass
train_state
toupdate_params
, or even justtrain_state['accumulated_submission_time']
. Notice thatupdate_params
already receives in inputeval_results
, which informs the submission about the elapsed time at the moment of the last evaluation, which is not up-to date withaccumulated_submission_time
.