Population-Based Training for HyperOpt

Once we have addressed the async scheduling of jobs #8 I would love to implement population-based training for hyperparameter optimization (Jaderberg et al., 2017 - https://arxiv.org/abs/1711.09846). It appears to be the most efficient parallel + non-sequential tuning algorithm even for small population size (ca. 15 runs) and across multiple domains.

The general API looks as follows: Step - Eval - Ready? - If yes: Exploit? - If params changed: Explore. The steps are performed asynchronous and in parallel. More details on each step:

Step: Optimisation of network given fixed current hyperparams
Eval: Compute fitness/performance after optimization step
Ready: Population member undergoes explore/exploit only when a fixed number of Step updates has been done since the last time that member was ready. E.g. this could be 10k SGD updates.
Exploit: Different exploitation strategies
- T-test selection: Uniformly sample other agent/network in the population, and compare the last 10 episodic rewards/batch performances using Welch’s t-test. If the sampled agent has a higher mean episodic reward and satisfies the t-test, the weights and hyperparameters are copied to replace the current agent.
- Truncation selection: Rank all agents in the population by episodic reward/performance. If the current agent is in the bottom 20% of the population, we sample another agent uniformly from the top 20% of the population, and copy its weights and hyperparameters.
- Binary tournament: Each member of the population randomly selects another member of the population, and copies its parameters if the other member’s score is better. Whenever one member of the population is copied to another, all parameters – the hyperparameters and network weights are copied.
Explore: Different exploration strategies
- Perturb: Each hyperparameter independently is randomly perturbed by a factor of 1.2 or 0.8.
- Resample: Each hyperparameter is resampled from the original prior distribution defined with some probability.

Jaderberg algo

Important detail: PBT is not only a hyperparameter optimizer but also a model selection mechanism since we copy also weight parameters over!

The different mutations/steps/exploitation ranking themself don't appear to be hard to implement. But we do need a smooth logging setup as well as a standardized way of reloading network checkpoints. Probably have to differentiate between torch, tf, jax network checkpoint reloading.

mle-infrastructure / mle-toolbox

Population-Based Training for HyperOpt #20