Closed nginyc closed 5 years ago
@nudles I have added some details & reasoning on the major changes I'm going to make for architecture tuning. Let me know if you have any comments & advice on them!
In terms budget, can we let users to configure either hours or trials? rename QUICK_TRAIN to EARLY_STOP? when will DOWNSCALE be used?
MODEL_TRIAL_COUNT
for backward compatibility as well.DOWNSCALE
is used in architecture search for both NAS and ENAS. During the architecture search phase, the model constructed has fewer layers (e.g. 6 layers) to speed up the search, and its performance is somewhat a proxy of actual performance. On the other hand, at the final train phase, the final model trained from scratched is full-sized (e.g. 15 layers), which will be the one that is going to give the best performance.
With "Efficient Neural Architecture Search via Parameter Sharing"
Planned major changes
To better support architecture tuning with ENAS, I'm planning changes to Rafiki's current model training framework:
Replacing budget option
MODEL_TRIAL_COUNT
withTIME_HOURS
Context
Currently, when application developers create model training jobs, they pass a budget like
{ 'GPU_COUNT': 1, 'MODEL_TRIAL_COUNT': 20 }
, withMODEL_TRIAL_COUNT
deciding the no. of trials to conduct for each model template.Change
Replace
MODEL_TRIAL_COUNT
option withTIME_HOURS
option, which specifies how long the train job should run for. It is a soft time target. At the same time, I'll be reworking the Advisor component (which proposes trials' knobs) such that it is additionally in charge of deciding how many trials to run, when to stop each worker, when to stop the train job, given the budget e.g.GPU_COUNT
andTIME_HOURS
.Reasons for change
TIME_HOURS
is more straightforward.Introducing
PolicyKnob
Motivation
I have been integrating ENAS as a new model tuning strategy on Rafiki (e.g. at the Advisor component). If model templates want to do architecture tuning with ENAS, the model's training code needs to switch between different "modes":
Similarly, when you think about a standard hyperparameter tuning procedure, you might want the model to do early-stopping for the first e.g. 100 trials, then conduct a final trial for a full e.g. 300 epochs.
In both architecture tuning & hyperparameter tuning, the model needs to be configured by Rafiki somehow to switch between these "modes" on a trial-basis.
Change
We can model the configuration of a model template for different training "modes" with different model policies. For example, if a model is to engage in policy
QUICK_TRAIN
, it is to prematurely speed up its training step e.g. by either doing early-stopping or reducing the no. of epochs. The model communicates to Rafiki which policies it supports by addingPolicyKnob(policy_name)
to itsknob_config
. On the other hand, Rafiki configures the activation of the model's policies on a trial-basis by realising the values ofPolicyKnob
s to eitherTrue
(activated) orFalse
(not activated).For example, here is a example knob config of a model which supports the policy
QUICK_TRAIN
:Whenever the model is to do early-stopping, Rafiki will pass
quick_train=True
as part of the model's knobs. Otherwise, the model defaults to full-length training.Here is my current documentation for
PolicyKnob
: