Planned major changes

To better support architecture tuning with ENAS, I'm planning changes to Rafiki's current model training framework:

Replacing budget option `MODEL_TRIAL_COUNT` with `TIME_HOURS`

Context

Currently, when application developers create model training jobs, they pass a budget like { 'GPU_COUNT': 1, 'MODEL_TRIAL_COUNT': 20 }, with MODEL_TRIAL_COUNT deciding the no. of trials to conduct for each model template.

Change

Replace MODEL_TRIAL_COUNT option with TIME_HOURS option, which specifies how long the train job should run for. It is a soft time target. At the same time, I'll be reworking the Advisor component (which proposes trials' knobs) such that it is additionally in charge of deciding how many trials to run, when to stop each worker, when to stop the train job, given the budget e.g. GPU_COUNT and TIME_HOURS.

Reasons for change

May not be intuitive to the application developer to specify no. of trials while creating a train job ("how many trials should I put as budget? how long do I need to wait?"), especially if they're not supposed to be familiar with details like how model are trained and tuned. In contrast, TIME_HOURS is more straightforward.
Currently, different models & model tuning strategies would require different no. of trials to be effective. For example, the original ENAS tuning strategy requires maybe (301x150+10+1) trials for sufficient train-eval cycles.
In the future, it gives more flexibility for model tuning strategies at the Advisor component - for example, I'll be adding a new type of tuning strategy that takes all the models with no hyperparameters (e.g. model's knob config only consists of fixed values) and just conducts a single trial (since there's nothing to tune). It's also possible that a new tuning strategy can situationally conduct more/fewer trials based on feedback from workers.

Introducing `PolicyKnob`

Motivation

I have been integrating ENAS as a new model tuning strategy on Rafiki (e.g. at the Advisor component). If model templates want to do architecture tuning with ENAS, the model's training code needs to switch between different "modes":

During the ENAS architecture search phase, the model needs to alternate between "train my parameters for 1 epoch" and "don't train my parameters; just evaluate on the validation dataset"
At the end of the architecture search, the model needs to switch to training its parameters from scratch with a full-sized architecture stacked with more cells, and train for 310 epochs

Similarly, when you think about a standard hyperparameter tuning procedure, you might want the model to do early-stopping for the first e.g. 100 trials, then conduct a final trial for a full e.g. 300 epochs.

In both architecture tuning & hyperparameter tuning, the model needs to be configured by Rafiki somehow to switch between these "modes" on a trial-basis.

Change

We can model the configuration of a model template for different training "modes" with different model policies. For example, if a model is to engage in policy QUICK_TRAIN, it is to prematurely speed up its training step e.g. by either doing early-stopping or reducing the no. of epochs. The model communicates to Rafiki which policies it supports by adding PolicyKnob(policy_name) to its knob_config. On the other hand, Rafiki configures the activation of the model's policies on a trial-basis by realising the values of PolicyKnobs to either True (activated) or False (not activated).

For example, here is a example knob config of a model which supports the policy QUICK_TRAIN:

Whenever the model is to do early-stopping, Rafiki will pass quick_train=True as part of the model's knobs. Otherwise, the model defaults to full-length training.

Here is my current documentation for PolicyKnob:

'''
    Knob type representing whether a certain policy should be activated, as a boolean.
    E.g. the `QUICK_TRAIN` policy knob decides whether the model should stop model training early, or not. 
    Offering the ability to activate different policies can optimize hyperparameter search for your model. 
    Activation of all policies default to false.

    =====================       =====================
    **Policy**                  Description
    ---------------------       --------------------- 
    ``SHARE_PARAMS``            Whether model supports parameter sharing       
    ``QUICK_TRAIN``             Whether model should stop training early in `train()`, e.g. with use of early stopping or reduced no. of epochs
    ``SKIP_TRAIN``              Whether model should skip training its parameters
    ``QUICK_EVAL``              Whether model should stop evaluation early in `evaluate()`, e.g. by evaluating on only a subset of the validation dataset
    ``DOWNSCALE``               Whether a smaller version of the model should be constructed e.g. with fewer layers
    =====================       =====================

'''

nginyc commented 5 years ago

@nudles I have added some details & reasoning on the major changes I'm going to make for architecture tuning. Let me know if you have any comments & advice on them!

nudles commented 5 years ago

In terms budget, can we let users to configure either hours or trials? rename QUICK_TRAIN to EARLY_STOP? when will DOWNSCALE be used?

nginyc commented 5 years ago

Okay, I will keep MODEL_TRIAL_COUNT for backward compatibility as well.
Noted on the suggestion.
DOWNSCALE is used in architecture search for both NAS and ENAS. During the architecture search phase, the model constructed has fewer layers (e.g. 6 layers) to speed up the search, and its performance is somewhat a proxy of actual performance. On the other hand, at the final train phase, the final model trained from scratched is full-sized (e.g. 15 layers), which will be the one that is going to give the best performance.

nginyc / rafiki

Model developers to tune architecture #50

Planned major changes

Replacing budget option `MODEL_TRIAL_COUNT` with `TIME_HOURS`

Context

Change

Reasons for change

Introducing `PolicyKnob`

Motivation

Change

nginyc / rafiki

Model developers to tune architecture #50

Planned major changes

Replacing budget option MODEL_TRIAL_COUNT with TIME_HOURS

Context

Change

Reasons for change

Introducing PolicyKnob

Motivation

Change

Replacing budget option `MODEL_TRIAL_COUNT` with `TIME_HOURS`

Introducing `PolicyKnob`