A model-agnostic (only gradient decent) meta learning algorithm aims to find a good initialization point for model such that it can be fine-tuned quickly on new tasks.
Shows STOA on few-shot image classification, regression and fast fine-tuning for policy gradient.
How does it work?
Sample a batch of tasks (with few training data and "virtual" test data; the "virtual" test data is constructed from training data).
For each task_{i}
Compute gradient w.r.t L(θ) on training data and update model θ -> θ'_{i}.
Compute L(θ'_{i}) on virtual test data.
Compute gradient w.r.t. Σ{i} L(θ'{i}) and update model θ.
The loss can be cross-entropy loss for classification, MSE for regression and reward for RL.
FOMAML
They also tried ignoring second derivatives θ'{i} in task{i}, just update θ using Σ{i} L(θ{i}) (no need train/test splits in original task setup).
This is denoted as first-Order MAML (FOMAML).
Shows comparable to second-order MAML and save more computation time.
Introduce a new FOMAML algorithm, Reptile, which works by repeatedly sampling a task, training on it, and moving the initialization towards the trained weights on that task.
Really worth-reading, especially its analysis on SGD and MAML!
How does it work?
The U^{k}_{T} means we take k gradient updates in the sampled task T; ϵ = 1/α, i.e. learning rate.
We can update in batch version (n = number of tasks):
If we only take k = 1 update, this is essentially SGD on the expected loss.
If we take k > 1 updates, it is not. This will converge differently which considers second-and-higher derivatives.
Experiment on 5-shot 5-way Omniglot shows different inner-loop gradient combinations:
Why does it work?
Through Taylor expansion analysis, both MAML and Reptile contain the same leading order terms:
First: minimizing the expected loss (joint training on different tasks).
Second: maximizing within-task generalization, i.e., maximizing the inner product between k gradients from the same task. If gradients from different batch has positive inner product, then taking a gradient step on one batch improves performance on the other batch.
The result of Taylor expansion on SGD and MAML (i in [1, k]):
This explains k = 2 in the above experiment is still insufficient since it puts less weight on the second inner product term relative to the first term.
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (MAML)
On First-Order Meta-Learning Algorithms (Reptile)
Probabilistic Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-Learning
Meta-Learning with Latent Embedding Optimization (LEO)
How to Train Your MAML (MAML++)