ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.86k stars 5.75k forks source link

[Tune] Slow tasks halt tuning #48121

Open fabio-ciani opened 2 weeks ago

fabio-ciani commented 2 weeks ago

Description

Bug

If the score computation associated to a trial of hyperparameter tuning takes too long, the patience of BayesOptSearch saturates because of duplicated configurations.

This is likely due to other tasks being launched simultaneously to mask and solve the slowness of the original task, trying to complete its uncommitted work.

For example, see the attached code snippet. The procedure should sample 10 random points to initialize the search, and then do another 10 steps. However, at the 11th iteration, the call stops, meaning that 9 points are missing and will never be sampled.

Solution

Apart from more sensible fixes, the issue can be avoided by increasing the patience argument of the optimizer.

Related issues

13234, #28063.

System info

Hardware

Operating System: Ubuntu 24.04 LTS Kernel: Linux 6.8.0-39-generic Architecture: x86-64

Software

Python 3.12.3

bayesian-optimization==1.5.1 ray==2.37.0

Example script

import math
import time

from ray import tune
from ray.tune.search.bayesopt import BayesOptSearch

def objective(config):
    time.sleep(10.0) # Simulate slow-run task.
    x = config["x"]
    return {"score": x ** 3 * math.log(x) - x ** 2}

search_space = {"x": tune.uniform(0.0, 2.0)}
algo = BayesOptSearch(random_search_steps=10) # Set patience to high value here to suppress bug.
tuner = tune.Tuner(
    objective,
    tune_config=tune.TuneConfig(num_samples=20, metric="score", mode="min", search_alg=algo),
    param_space=search_space
)

results = tuner.fit()
print(results.get_best_result().config)

Severity

Medium: It is a significant difficulty but I can work around it.

fabio-ciani commented 2 weeks ago

As follow-up note, it is advisable to limit concurrency after the initialization phase of Tuner. Overriding this option can be crucial, in particular when considering inherently sequential methods like Bayesian optimization. (See #13962.)

import math
import time

from ray import tune
from ray.tune.search import ConcurrencyLimiter
from ray.tune.search.bayesopt import BayesOptSearch

def objective(config):
    time.sleep(10.0)
    x = config["x"]
    return {"score": x ** 3 * math.log(x) - x ** 2}

# Exploration.
search_space = {"x": tune.uniform(0.0, 2.0)}
algo = BayesOptSearch(random_search_steps=10, patience=1000)
tuner = tune.Tuner(
    objective,
    tune_config=tune.TuneConfig(num_samples=10, metric="score", mode="min", search_alg=algo),
    param_space=search_space
)
tuner.fit()

# Exploitation.
tuner = tune.Tuner(
    objective,
    tune_config=tune.TuneConfig(num_samples=10, metric="score", mode="min", search_alg=ConcurrencyLimiter(algo, max_concurrent=1)),
)
results = tuner.fit()
print(results.get_best_result().config)

This represents an alternative solution that feels more natural and tailored to the use-case scenario.

jcotant1 commented 3 days ago

Hey @hongpeng-guo @justinvyu just following up on this one, any insight here? Thx