As our project is similar to NNI by Microsoft, I thought it might be good to study how they're doing things, compare it how we're doing things, and derive some learnings.
How model's hyperparameter search space is defined
NNI calls a set of knobs "Configuration", the knob config "Search Space"
In NNI, "Search Space" is defined as JSON on separate file
In Rafiki, Knob Config is defined in Python with typed "Knob" classes as part of the model code
My opinion:
Rename "knob config" -> "knob space" for clarity?
More flexible & powerful to configure dynamically with Python
Edits to search space are simpler if written in Python alongside model code in the same file
Submitting a separate configuration file can be more troublesome
Using JSON is more convenient if hyperparameter search space is to be tweaked often, independently of model code
How model developers configure the AutoML algorithm
In NNI, model developers need to configure an "Experiment" in a YAML file
Choice of configuration: which "Tuner" to use, configuration for that chosen tuner, max no. of trials, max training duration, no. of GPUs, platform to train (e.g. local machine, Kubernetes)
A single model is trained for each experiment
In NNI, pointers to datasets are hard-coded in model code, and there's no concept of "task"
In Rafiki, application developers configure a train job by simply submitting a task, a budget, datasets and maybe model IDs in Python
Rafiki matches task to a set of models, and trains these multiple models concurrently
Rafiki manages provisioning of training platform & GPUs
Rafiki automatically selects & configures which advisor to use based on hyperparameter search space
Due to differences between designs of Rafiki and NNI:
In Rafiki, a non-expert application developer initiates training instead of a model developer, so configuring training should be non-technical and, as much as possible, abstract away complexity of model selection & tuning configuration
Rafiki is designed to be more end-to-end, as a ML-as-a-service
My opinion:
As with NNI, should model developers in Rafiki be able to optionally configure how their models are tuned e.g. which advisor to use, configuration of advisor?
Allows model developers to select more appropriate / empirically better AutoML algorithms for their models, but more burden on them
Maybe with another static class method
Current abstraction & definition of budget in Rafiki is appropriate
How the model interfaces with the AutoML system
In NNI, AutoML system calls upon model code by simply running main Python file (i.e. triggers main method). Supports a directory of Python files
In Rafiki, system calls upon model code by importing a given class from a single Python file, then appropriately running methods on instances of that class
In NNI, model code calls upon AutoML system by importing nni module and calling e.g. nni.get_next_parameter() to get hyperparameters for trial, nni.report_final_result(metrics) to pass final metrics of a trial that is interpreted by the tuner
In Rafiki, model code imports utils module and calls e.g. utils.dataset..., utils.logger... for helper/logging methods. Return values to e.g. evaluate(dataset) pass final score back to system
My opinion:
NNI's interface maximises portability of existing model code - no need to rewrite into a class definition like in Rafiki
NNI's interface more loosely couples model code & AutoML system
But Rafiki's well-defined model class gives more flexibility/power to tuning algorithms (e.g. better control flow), and is more appropriate for our design
Unlike NNI, Rafiki needs to support predictions, loading & saving of model parameters
Consider documenting on how to port existing model code to Rafiki, or brainstorm on tweaking API to improve portability?
How the AutoML system configures the model's training behaviour
NNI configures model's training behaviour just through hyperparameters and its framework for early stopping with the concept of an "Assessor"
Model code can optionally call nni.report_intermediate_result(metrics) that is interpreted by the assessor, which kills the trial when the intermediate results are poor
No explicit support and extension for other ways to configure model's training behaviour other than early stopping e.g. loading of shared parameters, using a downscaled model
In Rafiki, we're thinking of configuring a model's training behaviour with PolicyKnob(policy_name) as part of model's knobs, so that model code can switch between different "modes" (e.g. early stop VS don't early stop)
My opinion: this can support more advanced tuning strategies e.g. we can introduce more policies in the future without changing Rafiki's code
How the AutoML system supports architecture tuning
NNI only currently supports GA-based architecture tuning, where the model code & tuner depend on a custom graph abstraction & architecture space definition as the "Search Space"
Which may not be general/flexible enough
As in previous section, NNI won't be able formally support an implementation of ENAS as the tuner needs to tell the model code to load shared parameters and switch between "train for 1 epoch" & "just evaluate on a subset of the validation dataset"
For Rafiki, we're thinking of representing architecture as an array of categorical values
More general (up to model developer to define encoding), but low-level and less "informative" for the architecture tuning algorithm
E.g.
l0 = KnobValue(0) # Input layer as input connection
l1 = KnobValue(1) # Layer 1 as input connection
l2 = KnobValue(2) # Layer 2 as input connection
ops = [KnobValue('conv3x3'), KnobValue('conv5x5'), KnobValue('avg_pool'), KnobValue('max_pool')]
arch_knob = ArchKnob([
[l0], ops, [l0], ops, # To form layer 1, choose input 1, op on input 1, input 2, op on input 2, then combine post-op inputs as preferred
[l0, l1], ops, [l0, l1], ops, # To form layer 2, ...
[l0, l1, l2], ops, [l0, l1, l2], ops, # To form layer 3, ...
])
Thanks for the comparison. I list some comments (not in order)
search space. NNI has another way of defining the search space, which uses annotation. It moves the hyper-parameter definition closer to the use place. The python code can run with and without hyper-parameter tuning.
by making NNI a library, it would be easier for local development and debugging. The running flow is controlled by model developers. Rafiki provides a platform for hyper-parameter search. Rafiki controls the flow. Like map-reduce, the system controls the flow and the developers fill in the code of map and reduce.
it would be good to decouple a system into modular components. We will have resource management, filesystem or datastore, hyper-parameter tuning, inference queueing, etc.
we may not be able to unify the architecture tuning algorithms and hyper-parameter tuning algorithms. E.g., it is difficult to even unify ENAS and DART algorithms..
mlflow and kubeflow are two other projects with the hyper-parameter tuning feature.
As our project is similar to NNI by Microsoft, I thought it might be good to study how they're doing things, compare it how we're doing things, and derive some learnings.
How model's hyperparameter search space is defined
How model developers configure the AutoML algorithm
How the model interfaces with the AutoML system
nni
module and calling e.g.nni.get_next_parameter()
to get hyperparameters for trial,nni.report_final_result(metrics)
to pass final metrics of a trial that is interpreted by the tunerutils
module and calls e.g.utils.dataset...
,utils.logger...
for helper/logging methods. Return values to e.g.evaluate(dataset)
pass final score back to systemHow the AutoML system configures the model's training behaviour
nni.report_intermediate_result(metrics)
that is interpreted by the assessor, which kills the trial when the intermediate results are poorPolicyKnob(policy_name)
as part of model's knobs, so that model code can switch between different "modes" (e.g. early stop VS don't early stop)How the AutoML system supports architecture tuning
As in previous section, NNI won't be able formally support an implementation of ENAS as the tuner needs to tell the model code to load shared parameters and switch between "train for 1 epoch" & "just evaluate on a subset of the validation dataset"
For Rafiki, we're thinking of representing architecture as an array of categorical values