secondmind-labs / trieste

A Bayesian optimization toolbox built on TensorFlow
Apache License 2.0
216 stars 42 forks source link

GaussianProcessRegression() optimize does not work in a subprocess #645

Open ioananikova opened 1 year ago

ioananikova commented 1 year ago

Describe the bug When a pool of processes is used for executing calls (like with concurrent.futures.ProcessPoolExecutor), the optimize() method of GaussianProcessRegression() will take forever and never finish. More specifically this happens in evaluate_loss_of_model_parameters().

To reproduce Steps to reproduce the behaviour:

  1. Create a pool of processes
  2. Make sure a GPR model is created in a process
  3. Update the model
  4. Then try to optimize the model (it will fail here)

A minimal reproducible code example is included to illustrate the problem. (change to .py) test_concurrent_trieste.txt

Expected behaviour The expected behavior is that the optimize function behaves as it would in a normal process (not subprocess). Usually this step takes less than a second to finish.

System information

Additional context Even if the import statements are in the subprocess, it fails.

uri-granta commented 8 months ago

(Confirmed that this is still broken with latest version, possibly hitting some sort of deadlock.)

uri-granta commented 8 months ago

This is somehow connected to the use of tf.function compilation. Disabling tracing with tf.config.run_functions_eagerly(True) allows the code example to run (though at the obvious expense of executing everything eagerly each time). Will investigate further.

uri-granta commented 8 months ago

It's also somehow connected to something trieste or one its dependant libraries does:

# COMMENTING OUT EITHER import trieste OR @tf.function MAKES THIS PASS!
import concurrent.futures
import tensorflow as tf
import trieste

@tf.function
def say_hi():
    tf.print("hi")

def concurrency_test(n):
    print(f"I'm going to say hi!")
    say_hi()

if __name__ == "__main__":
    with concurrent.futures.ProcessPoolExecutor(max_workers=1) as executor:
        executor.map(concurrency_test, [10])
uri-granta commented 8 months ago

Ok, so it looks like this is due to some state initialisation performed by tensorflow when you call it for the first time. Replacing import trieste with tf.constant(42) or similar in the example above also hangs.

The solution is to avoid importing trieste until you're inside the subprocess:

import concurrent.futures

WORKERS = 1

def test_concurrent(num_initial_points):
    from trieste.objectives.single_objectives import Branin
    import trieste
    from trieste.models.gpflow import GaussianProcessRegression, build_gpr
    print(f'num_initial_points: {num_initial_points}')
    branin_obj = Branin.objective
    search_space = Branin.search_space
    observer = trieste.objectives.utils.mk_observer(branin_obj)

    initial_query_points = search_space.sample_halton(num_initial_points)
    initial_data = observer(initial_query_points)
    print('initial data created')

    gpflow_model = build_gpr(initial_data, search_space, likelihood_variance=1e-7)
    model = GaussianProcessRegression(gpflow_model)
    print('model created')

    model.update(initial_data)
    print('model updated')
    model.optimize(initial_data)
    print('model optimized')

if __name__ == "__main__":
    with concurrent.futures.ProcessPoolExecutor(max_workers=WORKERS) as executor:
        executor.map(test_concurrent, [10])
uri-granta commented 8 months ago

I'll see whether we can document this anywhere. Does this solve your issue? (if you can remember back to October 2022!)