In principle, Line search with some modifications.
The main contribution is the theoretical analysis based on convex optimization.
Main points
Add the following three modifications to line search:
Restart after stuck in a local optima
adaptive step-size based on the theoretical analysis
rounding of configurations for discrete spaces
The problem setting does not have the concept of fidelity, but it states that the runtime of $f(x)$ changes based on the choice of $x$.
For the analysis of cost-aware, it is important to have Condition 3.
Condition 3 states that if the costs for $x_1, x_2$ has the relationship of $g(x^\star) < g(x_1) < g(x_2)$, $f(x^\star) < f(x_1) < f(x_2)$ also holds.
Note that $x^\star$ is a local optima.
Intuitively speaking, longer training time does not improve the model if the model achieved good performance with the cost of $g(x^\star)$, but we would rather end up overfitting with more cost.
Although this is not always the case, they cannot provide the theoretical analysis without this condition.
Experiments
They used the following methods:
Random search
BOHB
GPEI
GPEIPS (Expected improvement per second)
SMC
Their methods
Their benchmarks are self-made.
The visualizations are:
Performance over time
dataset-wise final performance distribution of each method
dataset-wise time-to-reach-best distribution of each method
==> I am not convinced of the distribution stuffs.
Frugal Optimization for Cost-related Hyperparameters
In principle, Line search with some modifications. The main contribution is the theoretical analysis based on convex optimization.
Main points
Add the following three modifications to line search:
The problem setting does not have the concept of fidelity, but it states that the runtime of $f(x)$ changes based on the choice of $x$.
For the analysis of cost-aware, it is important to have Condition 3.
Condition 3 states that if the costs for $x_1, x_2$ has the relationship of $g(x^\star) < g(x_1) < g(x_2)$, $f(x^\star) < f(x_1) < f(x_2)$ also holds. Note that $x^\star$ is a local optima. Intuitively speaking, longer training time does not improve the model if the model achieved good performance with the cost of $g(x^\star)$, but we would rather end up overfitting with more cost. Although this is not always the case, they cannot provide the theoretical analysis without this condition.
Experiments
They used the following methods:
Their benchmarks are self-made.
The visualizations are: