nginyc / rafiki

Rafiki is a distributed system that supports training and deployment of machine learning models using AutoML, built with ease-of-use in mind.
Apache License 2.0
36 stars 23 forks source link

Model developers to tune architecture #50

Closed nginyc closed 5 years ago

nginyc commented 6 years ago

With "Efficient Neural Architecture Search via Parameter Sharing"

Planned major changes

To better support architecture tuning with ENAS, I'm planning changes to Rafiki's current model training framework:

Replacing budget option MODEL_TRIAL_COUNT with TIME_HOURS

Context

Currently, when application developers create model training jobs, they pass a budget like { 'GPU_COUNT': 1, 'MODEL_TRIAL_COUNT': 20 }, with MODEL_TRIAL_COUNT deciding the no. of trials to conduct for each model template.

Change

Replace MODEL_TRIAL_COUNT option with TIME_HOURS option, which specifies how long the train job should run for. It is a soft time target. At the same time, I'll be reworking the Advisor component (which proposes trials' knobs) such that it is additionally in charge of deciding how many trials to run, when to stop each worker, when to stop the train job, given the budget e.g. GPU_COUNT and TIME_HOURS.

Reasons for change

Introducing PolicyKnob

Motivation

I have been integrating ENAS as a new model tuning strategy on Rafiki (e.g. at the Advisor component). If model templates want to do architecture tuning with ENAS, the model's training code needs to switch between different "modes":

Similarly, when you think about a standard hyperparameter tuning procedure, you might want the model to do early-stopping for the first e.g. 100 trials, then conduct a final trial for a full e.g. 300 epochs.

In both architecture tuning & hyperparameter tuning, the model needs to be configured by Rafiki somehow to switch between these "modes" on a trial-basis.

Change

We can model the configuration of a model template for different training "modes" with different model policies. For example, if a model is to engage in policy QUICK_TRAIN, it is to prematurely speed up its training step e.g. by either doing early-stopping or reducing the no. of epochs. The model communicates to Rafiki which policies it supports by adding PolicyKnob(policy_name) to its knob_config. On the other hand, Rafiki configures the activation of the model's policies on a trial-basis by realising the values of PolicyKnobs to either True (activated) or False (not activated).

For example, here is a example knob config of a model which supports the policy QUICK_TRAIN:

image

Whenever the model is to do early-stopping, Rafiki will pass quick_train=True as part of the model's knobs. Otherwise, the model defaults to full-length training.

Here is my current documentation for PolicyKnob:

'''
    Knob type representing whether a certain policy should be activated, as a boolean.
    E.g. the `QUICK_TRAIN` policy knob decides whether the model should stop model training early, or not. 
    Offering the ability to activate different policies can optimize hyperparameter search for your model. 
    Activation of all policies default to false.

    =====================       =====================
    **Policy**                  Description
    ---------------------       --------------------- 
    ``SHARE_PARAMS``            Whether model supports parameter sharing       
    ``QUICK_TRAIN``             Whether model should stop training early in `train()`, e.g. with use of early stopping or reduced no. of epochs
    ``SKIP_TRAIN``              Whether model should skip training its parameters
    ``QUICK_EVAL``              Whether model should stop evaluation early in `evaluate()`, e.g. by evaluating on only a subset of the validation dataset
    ``DOWNSCALE``               Whether a smaller version of the model should be constructed e.g. with fewer layers
    =====================       =====================

'''
nginyc commented 5 years ago

@nudles I have added some details & reasoning on the major changes I'm going to make for architecture tuning. Let me know if you have any comments & advice on them!

nudles commented 5 years ago

In terms budget, can we let users to configure either hours or trials? rename QUICK_TRAIN to EARLY_STOP? when will DOWNSCALE be used?

nginyc commented 5 years ago
  1. Okay, I will keep MODEL_TRIAL_COUNT for backward compatibility as well.
  2. Noted on the suggestion.
  3. DOWNSCALE is used in architecture search for both NAS and ENAS. During the architecture search phase, the model constructed has fewer layers (e.g. 6 layers) to speed up the search, and its performance is somewhat a proxy of actual performance. On the other hand, at the final train phase, the final model trained from scratched is full-sized (e.g. 15 layers), which will be the one that is going to give the best performance.