optuna / optuna

A hyperparameter optimization framework
https://optuna.org
Other
10.09k stars 979 forks source link

Speed up `TPESampler` using approximation in standard normal related computation #5478

Open nabenabe0928 opened 3 weeks ago

nabenabe0928 commented 3 weeks ago

Motivation

TPESampler significantly slows down for high-dimensional objectives and I checked where the slowdown comes from.

According to my check, _get_internal_repr in TPESampler and ndtr-related functions in _truncnorm.py were the causes.

As the Optuna design prefers the stateless nature of each sampler, I think it is hard to enhance _get_internal_repr while it is possible to enhance ndtr.

In fact, I could solve the following problem three times quicker with an approximation algorithm for standard normal distribution:

import time

import optuna

def objective(trial: optuna.Trial) -> float:
    if (trial.number + 1) % 50 == 0:
        print(trial.number + 1)

    return sum(trial.suggest_float(f"x{i}", -5, 5) ** 2 for i in range(10))

if __name__ == "__main__":
    optuna.logging.set_verbosity(optuna.logging.CRITICAL)
    sampler = optuna.samplers.TPESampler(seed=42)
    study = optuna.create_study(sampler=sampler)
    start = time.time()
    study.optimize(objective, n_trials=1000)
    print(time.time() - start, study.best_trial)

Description

By adding an option to use an approximation algorithm, we can speed up TPESampler routine.

There is a paper that discusses the approximation quality of standard normal distribution, c.f. APPROXIMATING THE CUMULATIVE DISTRIBUTION FUNCTION OF THE NORMAL DISTRIBUTION.

I used the standard logistic function, i.e. Eq. (13) in the paper, due to its invertibility necessary for ndtri_exp (however, we can also use different approximation methods for different functions). Note that invertibility is supported by Eqs. (1), (4), (5), (7), (8), and (13). For example, _ndtr is already vectorized, so we do not have to add an approximation algorithm for it and we can use better approximation such as Byrc (2001B) in Eq. (12) or Zelen and Severo (1964) in Eq. (2).

_logistic_C = -math.pi / math.sqrt(3)

def _ndtri_exp(y: np.ndarray) -> np.ndarray:
    # y = -log(1 + exp(cx)) --> x = 1/c * log(exp(-y) - 1)
    return 1 / _logistic_C * _log_diff(-y, np.array(0.0))

def _ndtr(a: np.ndarray) -> np.ndarray:
    return 1.0 / (1.0 + np.exp(_logistic_C * a))

def _log_ndtr(a: np.ndarray) -> np.ndarray:
    return -_log_sum(np.zeros_like(a), _logistic_C * a)

def _log_ndtr_by_byrc(a: np.ndarray) -> np.ndarray:
    # return np.frompyfunc(_log_ndtr_single, 1, 1)(a).astype(float)
    n2, n1, n0 = (1.0, 5.575192695, 12.77436324)
    d3, d2, d1, d0 = (_norm_pdf_C, 14.38718147, 31.53531977,  25.548726)
    a1 = np.abs(a)
    a2 = a1 * a1
    a3 = a1 * a2
    res = np.log(n2 * a2 + n1 * a1 + n0) - np.log(d3 * a3 + d2 * a2 + d1 * a1 + d0) - a2 * 0.5
    a_is_positive = a > 0
    res[a_is_positive] = _log_diff(np.array(0.0), res[a_is_positive])
    return res

Alternatives (optional)

Another option would be to add an option to use only the recent K trials for the surrogate training.