This TL;DR aims to help myself getting a quick view of neural architecture search literature. Hopes it helps you too 😁!
Designing neural network for task of interest often requires significant architecture engineering. Neural Architecture Search (NAS/AutoML) aims to automatically find a neural network with an optimal architecture such that it can achieve good or even state-of-the-art performance on the task of interest. Most of the NAS work focus on designing the search space (what activation functions or cell operations to search on). Two main categories of NAS: Evolutionary (ES) algorithms and Reinforcment learning (RL) algorithms. Here we mainly overview RL-based methods.
Uses a RNN controller to output variable-length tokens (i.e., search space), which specifies a neural network.
Such neural network is then built, trained, and yeild validation accuracy on validation set.
Train RNN with this validation accuracy as reward using policy gradient, to encourage searched network generalized well instead of overfitting on training set. (This is essentially meta-learning)
Also modified the RNN controller to output recursively to search the operations for hidden states in RNN/LSTM cell.
Trained on 800 GPUs and searched ~15000 architectures over 28 days.
Evaluated on CIFAR-10 (3.65% error rate) and PTB dataset (word: 62.4 ppl; char: 1.214 ppl).
Address large computation when directly search on large dataset (e.g., ImageNet): Modify the search space to search for the best cell structure instead of an entire network structure, and may better generalize to other tasks (called NASNet search space).
Similar to RNN/LSTM cell [1], first predict which two input hidden states and then predicts what operations to apply on them.
The number of cell stacking is considered as hyperparameter tailored to the scale of network.
They first search on CIFAR-10 (called NASNets) and transfer to ImageNet without much modification.
The optimization is same as NAS [1] but replace vanilla policy gradient as proximal policy optimization (ppo) method.
Trained on 450 GPUs over 4 days and is 7x faster than [1].
Evaluated on CIFAR-10 (2.4% error rate) and ImageNet (82.7% top-1 acc = SENet)
The image feature learned by NASNets are generally useful and transfer to other computer vision tasks (e.g., object detection achieves 43.1% mAP on COCO).
Instead of search and train an network from scratch repeatly, they propose to search an network transformation operation (Net2Net) given the current searched network.
Net2Net operations: Net2Wider (replace a layer with a wider layer) & Net2Deeper (insert a new layer).
The bi-LSTM takes current architecture as consideration and output on which Net2Net operation to apply.
Trained on 5 GPUs less than 2 days.
Evaluated on CIFAR-10+ (4.23% error rate) and SVHN (1.83% error rate).
Built on NASNet [2], they propose a sequential model-based optimization (SMBO) strategy (from training simple network to complex one) which is 5x more efficient and 8x computationally faster than [2]. (However they empirically reduce search space compared to NASNet [2])
Begin by training all 1-block cells. They are only 136 (unique) combinations.
Pick K most promising cells expand into 2-block cells, and iterate again (up to 5-block). However cell structures of 2-block has ~10^5 combinations.
So we train a surrogate model (MLP or LSTM) predicts the final performance of a cell structure, to help pick promising 2-block cell structures.
Evaluated on CIFAR-10 (3.41% test error) and ImageNet (82.9% top-1 acc).
Search an optimal subgraph in current large computational graph (a whole neural network presents its search space).
ENAS allows parameters to be shared by representing NAS's search space using a single directed acyclic gragh (DAG), where an architecture can be realized by taking a subgraph of the DAG (node = local computation; edge = information flow; each pair of node has a weight parameter matrix).
ENAS controller decides: 1) which edge are activated; 2) which computation are performed at each node. This allows ENAS to design both topology and operation in RNN cell, which is different from NAS [1].
Alternative training: 1) For a minibatch, train K network sampled from the ENAS controller. K = 1 works. Train after an entire epoch; 2) Train ENAS controller with policy gradient like NAS [1].
Final architecture selection: Sample M network, evaluate on a batch of validation set, select the network that gets best performance and retrain from scratch.
Trained on 1 GPU less than 16 hours.
Evaluated on CIFAR-10 (2.89% error rate) PTB (word 55.8 ppl).
Worst a bit than NAS [1] since NAS [1] explores different (more) topologies, while ENAS search on predefined topology.
Understand why a heterogeneous set of architectures be able to share a single set of weights ENAS [5] (i.e., one-shot model).
Show that no need of a RL controller to find a good subnet in one-shot model.
It is possible to predict an architecture’s validation set accuracy by looking at its behavior on unlabeled examples from the training set.
Method:
Define one-shot model (i.e., search space);
Train the one-shot model. To ensure one-shot model accuracy correlates well with its subgraphs (architectures):
Use linear scheduled path dropout (i.e., randomly zero out a sub-ops for each batch of examples)
In a single cell, different ops are dropped out independently. In a multi-cell, the same ops will be dropped out in each one.
Use (ghost) batchnorm stabilize (BN-Relu-Conv) the training to prevent path drop out from changing the batch statistics. Partition each training batch into multiple ghost batches, since drop out independently for every example in the batch does not work. Drop out the same path for the same ghost batch.
Apply L2-norm only to parts of one-shot model that are used by the current architecture, in order to balance the frequently dropped out path that are regularized more.
Evaluate the architectures that are randomly sampled independently from the trained one-shot model on validation set. The random search can be replaced by ES or RL.
Optimize architecture search in continuous space (only gradient descent) instead of the above approaches that optimize in discrete space using RL.
Search space: cell topology (like ENAS).
Assume the conv or rnn cell has 2 input nodes (features) and a single output node.
Each intermediate node is obtained by applying an operation/edge (conv, pool, activations, etc.).
Relax the categorical choice of particular operation to a softmax over all possible operations.
Each operation/edge associates a learnable (attention) weight. (Search space is then reduced to finding these attention weights, i.e., each set of attention weights represents an architecture).
Formulate to bilevel optimization problem:
Outer-loop (upper-level): Find optimal attention weights that minimize the validation loss,
Inner-loop (lower-level): where the optimal model weights are obtained by minimizing the training loss.
Approximate the gradient of attention weights with the eq. (8) in the paper (only required two step training in inner-loop).
A discrete architecture can be finally derived by argmaxing the weight of each operation.
Does not include architecture selection cost or require retraining selected architecture.
Evaluated on CIFAR-10 (2.76% error rate), PTB dataset (word: 58.1 ppl) and ImageNet (26.7% error rate).
Integrates NAS with instance awareness and searches for a distribution of architectures (i.e., allows instance to have their own parameter-shared architecture).
Search space: One-shot architecture (like ENAS) similar to MobileNetv2.
Controller is trained instance-aware and optimize for multi-objective (acc. and latency).
Achieves 48.9%, 40.2%, 35.2% and 14.5% (+26.5% if ~0.7% acc. drop is accepted) latency reduction on CIFAR-10, CIFAR100, TinyImageNet and ImageNet, respectively.
Previous work trains NAS on "proxy task" (e.g., CIFAR-10) and transfer to ImageNet in order to alleviate huge computational cost), but the searched architecture is not guaranteed to be optimal.
ProxylessNAS can direct search on ImageNet and remove the restriction of only searching on cell and repeating them.
Address the large GPU memory consumption problem caused by large one-shot architectures by path binarization (i.e., binarized attention weights of DARTS) and force only one path active.
Optimize non-differentiable hardware objectives (e.g., latency) as a continuous function and treat it as regularization loss.
As the title suggests, they propose a method to train a network, which contains several sub-networks specialized for different computational resources once, instead of training each specialized sub-network from scratch.
4% top-1 accuracy improvement on ImageNet; same accuracy but 1.5x faster than MobileNetV3, 2.6x faster than EfficientNet w.r.t. measured latency, while reducing many orders of magnitude GPU hours; 1st places on 3rd & 4th Low Power CV Challenges on both classification and detection tracks.
So, the main point of this paper is, how to train this "Once-For-All" (OFA) network? Progressive shrinking!
To my understanding, progressive shrinking enables sub-networks (or "pruned" networks) to share the weights from the teacher network, which may be better than pruning network directly (i.e., some "distillation" spirit inside this idea).
Neural Architecture Search TL;DR
This TL;DR aims to help myself getting a quick view of neural architecture search literature. Hopes it helps you too 😁!
[1] Neural Architecture Search with Reinforcement Learning (NAS). ICLR 2017.
[2] Learning Transferable Architectures for Scalable Image Recognition (NASNet). CVPR 2018.
[3] Efficient Architecture Search by Network Transformation (EAS). AAAI 2018. [Code]
[4] Progressive Neural Architecture Search (PNAS). ECCV 2018. [TF Code][PT Code]
[5] Efficient Neural Architecture Search via Parameter Sharing (ENAS). ICML 2018.
[6] Understanding and Simplifying One-Shot Architecture Search (One-Shot). ICML 2018.
[7] DARTS: Differentiable Architecture Search. ICLR 2019.
[8] InstaNAS: Instance-aware Neural Architecture Search. AAAI 2020 | ICML 2019 Workshop. [Website][Code]
[9] ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware. ICLR 2019. [Website] [Poster] [Code]
[10] Once-For-All: Train One Network and Specialize It for Efficient Deployment. ICLR 2020.
For Latest State of NAS: See A Comprehensive Survey of Neural Architecture Search: Challenges and Solutions