Reintroduce Joblib-based multiprocessing backend option

j-adamczyk commented 1 year ago

Motivation

In the current version of Optuna, there is no way to easily perform multiprocessing inside a single Python script. Running multiple terminals is impossible in automated cloud environments, and even if this is possible, it's plainly a bad design. Scikit-learn and related APIs support n_jobs exactly for this purpose - if I set n_jobs=-1, I utilize all my CPU cores. Of course, using threads first is a reasonable design for Optuna, but there should be an option to change this behavior to use processes instead.

This would be useful for CPU-bound jobs there a single job cannot be parallelized easily. The most common use case is SVM in Scikit-learn, which is single-threaded, but it requires extensive hyperparameter tuning, which can be done in parallel. Another use case is training multiple neural networks on the same GPU, within multiple CPU processes, e.g. Graph Neural Networks (GNNs).

An additional advantage would be that OptunaSearchCV would have the same meaning of n_jobs as the Scikit-learn, which it integrates with.

Description

The problem lies in the _optimize function, here:

with ThreadPoolExecutor(max_workers=n_jobs) as executor:
    for n_submitted_trials in itertools.count():
        if study._stop_flag:
...

Since thread-based executor is hardcoded here, there is no way to specify anything else. Even setting the Joblib backend, which used to work, cannot work here.

However, if we could specify the executor, the user could use any backend supported by Joblib: regular Python multithreading or multiprocessing, Loky (efficient multiprocessing, default in Scikit-learn) or even anything else. I suggest using Joblib, as this is the easiest and the most flexible option, also arguably the most popular.

There would be 2 changes required:

Add parallel_backend option to _optimize() and function that are calling it, specifying the Joblib backend to use.
Use joblib.Parallel() instead of ThreadPoolExecutor in _optimize() for multiple jobs case.

Note that this does not require any changes to the RDB backend, as this is exactly equivalent to running Optuna in separate terminals.

Alternatives (optional)

Currently, the only alternative is to manually launch multiple Optuna trials via Joblib (taken from here):

joblib.Parallel(n_jobs)(
    joblib.delayed(optimize_study)(study.study_name,
                                  storage_string,
                                  objective, n_trials=25)
    for i in range(n_jobs)
)

However, this requires a manual wrapper around a core, important functionality. I have used this approach in multiple projects, but copy-pasting this so many times makes me feel like this should just be built in.

Additional context (optional)

No response

c-bata commented 1 year ago

In the current version of Optuna, there is no way to easily perform multiprocessing inside a single Python script.

Since the concurrent.futures module provides a high-level API, I think it would basically work if you changed it to ProcessPoolExecutor like the following.

import optuna
# from concurrent.futures import ThreadPoolExecutor
from concurrent.futures import ProcessPoolExecutor as ThreadPoolExecutor

def objective(trial):
    x = trial.suggest_float("x", -100, 100)
    y = trial.suggest_float("y", -100, 100)
    return x**2 + y

def main():
    study = optuna.create_study(storage="sqlite:///db.sqlite3")   # Please use RDBStorage, JournalStorage or DaskStorage.
    with ThreadPoolExecutor(max_workers=5) as pool:
        for i in range(5):
            pool.submit(study.optimize, objective, n_trials=10)
    print(f"Best params: {study.best_params}")

if __name__ == '__main__':
    main()

Using concurrent.futures makes it unnecessary for users to install joblib as an additional dependency and simplifies Optuna's source code. What do you think?

j-adamczyk commented 1 year ago

@c-bata that's a nice solution, I think even from concurrent.futures import ProcessPoolExecutor as ThreadPoolExecutor would suffice, as long as I import this before Optuna. However, this has 2 downsides:

This is a totally nonobvious hack, relying on import ordering and Optuna internals.
This still means that n_jobs has different meaning for Scikit-learn and its integration with Optuna.
I would still need to modify the code outside Optuna, when it can be parallelized internally.

But this would make an easy change for Optuna behavior. Simply using either ProcessPoolExecutor or ThreadPoolExecutor in _optimize() would suffice for many use cases. However, arbitraty executors should also be supported, since they may offer major advantages. Most notably, loky backend and executor is a more robust solution than plain multiprocessing, and faster e.g. when using Numpy arrays. Using Joblib would make all those 3 options use the same API, but is not strictly necessary.

However, Scikit-learn already depends on Joblib, so a large chunk of Optuna users already depend on it anyway. Also, Optuna used to depend on Joblib, since older issues reference it. This is a relatively self-contained and lightweight dependency.

I see 3 options:

Still require user to do manual loops or import hacking - this issue is about changing this.
Add option to choose Python-based processes or threads, and add argument to _optimize() and other functions to support switching between the two. Does not add any dependencies and is simpel, but is less robust and slower (when using Numpy at least) than option 3.
Add Joblib dependency and use it in _optimize(), with threads as the default backend, but with processes and Loky options. This is the most robust solution and still easy to implement, but adds a dependency (but a small and common one).

okaikov commented 1 year ago

Hi, Any updates regarding this feature request?

j-adamczyk commented 1 year ago

@okaikov unfortunately not, as far as I know. I see this as a major problem with Optuna, and I am currently researching other frameworks. Potentially using JoblibStudy from this PR (which sadly also got closed, which means the problem is still there) may be good for your use case, but it requires copy-pasting that code.

FlorinAndrei commented 9 months ago

makes it unnecessary for users to install joblib as an additional dependency

This is not a serious problem. joblib is already used by many important libraries. It gets pulled as a dependency as soon as you install something as widespread as scikit-learn.

Not having true multiprocessing in Optuna is a significant limitation at this point.

FlorinAndrei commented 9 months ago

One problem that the joblib workaround does not solve is that, when you have multiple separate processes, each running study.optimize(), there is no shared in-memory storage. You have to use a shared external storage, which can be slow. I would very much like to run efficient multiprocessing search with Optuna with the in-memory storage, but right now I can't.

cgr71ii commented 5 months ago

One problem that the joblib workaround does not solve is that, when you have multiple separate processes, each running study.optimize(), there is no shared in-memory storage.

This leads to the problem that even using the provided alternative by @j-adamczyk , you are running N processes with the same set of hyperparameters... Does anyone have a workaround for the shared memory in order to avoid having the same set of hyperparameters N times? If I'm not wrong, this also affects to your configured pruning strategy.

ohm-car commented 3 months ago

True multiprocessing would indeed be very helpful. I would really love to run multiple trials simultaneously on a server with 4+ gpus and get my results faster. Currently I just parallelize the model on how much ever gpus are available, but multiple trials will be helpful.

optuna / optuna