ray-project / tune-sklearn

A drop-in replacement for Scikit-Learn’s GridSearchCV / RandomizedSearchCV -- but with cutting edge hyperparameter tuning techniques.
https://docs.ray.io/en/master/tune/api_docs/sklearn.html
Apache License 2.0
465 stars 52 forks source link

Trouble running TuneSearchCV with hydra and skorch #202

Closed akzaidi closed 3 years ago

akzaidi commented 3 years ago

Hi,

Thanks for all the hard work with library and the broader ray ecosystem!

I have been trying to add tune-sklearn to a generic class for tuning in a script. However, I'm hitting the following error when trying to use TuneSearchCV. The scripts utilize hydra for configuration management, and here's the relevant invocation for TuneSearchCV using Bayesian optimization on a skorch model

python ddm_trainer.py model=torch model.sweep.run=True model.sweep.search_algorithm=bayesian
# errors:
datamodeler  : INFO     Building model...
[2021-04-19 09:31:39,523][datamodeler][INFO] - Building model...
datamodeler  : INFO     Sweeping with parameters: {'lr': [0.01, 0.02], 'module__num_units': [10, 50]}
[2021-04-19 09:31:39,524][datamodeler][INFO] - Sweeping with parameters: {'lr': [0.01, 0.02], 'module__num_units': [10, 50]}
/Users/alizaidi-msft/miniconda3/envs/ddm/lib/python3.7/site-packages/tune_sklearn/tune_basesearch.py:429: UserWarning: early_stopping is enabled but max_iters = 1. To enable partial training, set max_iters > 1.
  category=UserWarning)
[2021-04-19 09:31:41,756][tune_sklearn.tune_basesearch][INFO] - TIP: Hiding process output by default. To show process output, set verbose=2.
[2021-04-19 09:31:41,875][ray.tune.trial_runner][WARNING] - Trial Runner checkpointing failed: can't pickle dict_values objects
[2021-04-19 09:31:44,629][ray.tune.trial_runner][ERROR] - Trial _Trainable_b8f96ba4: Error processing event.
Traceback (most recent call last):
  File "ddm_trainer.py", line 74, in main
    scoring_func=cfg["model"]["sweep"]["scoring_func"],
  File "/Users/alizaidi-msft/Documents/bonsai/datadrivenmodel/torch_models.py", line 183, in sweep
    search.fit(X, y)
  File "/Users/alizaidi-msft/miniconda3/envs/ddm/lib/python3.7/site-packages/tune_sklearn/tune_basesearch.py", line 664, in fit
    result = self._fit(X, y, groups, **fit_params)
  File "/Users/alizaidi-msft/miniconda3/envs/ddm/lib/python3.7/site-packages/tune_sklearn/tune_basesearch.py", line 565, in _fit
    analysis = self._tune_run(config, resources_per_trial)
  File "/Users/alizaidi-msft/miniconda3/envs/ddm/lib/python3.7/site-packages/tune_sklearn/tune_search.py", line 715, in _tune_run
    analysis = tune.run(trainable, **run_args)
  File "/Users/alizaidi-msft/miniconda3/envs/ddm/lib/python3.7/site-packages/ray/tune/tune.py", line 421, in run
    runner.step()
  File "/Users/alizaidi-msft/miniconda3/envs/ddm/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 402, in step
    self._process_events(timeout=timeout)  # blocking
  File "/Users/alizaidi-msft/miniconda3/envs/ddm/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 560, in _process_events
    self._process_trial(trial)
  File "/Users/alizaidi-msft/miniconda3/envs/ddm/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 586, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/Users/alizaidi-msft/miniconda3/envs/ddm/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 609, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/Users/alizaidi-msft/miniconda3/envs/ddm/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/Users/alizaidi-msft/miniconda3/envs/ddm/lib/python3.7/site-packages/ray/worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ModuleNotFoundError): ray::_Trainable.train_buffered() (pid=5485, ip=10.0.0.29)
  File "python/ray/_raylet.pyx", line 439, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 442, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 459, in ray._raylet.execute_task
  File "/Users/alizaidi-msft/miniconda3/envs/ddm/lib/python3.7/site-packages/ray/serialization.py", line 245, in deserialize_objects
    self._deserialize_object(data, metadata, object_ref))
  File "/Users/alizaidi-msft/miniconda3/envs/ddm/lib/python3.7/site-packages/ray/serialization.py", line 192, in _deserialize_object
    return self._deserialize_msgpack_data(data, metadata_fields)
  File "/Users/alizaidi-msft/miniconda3/envs/ddm/lib/python3.7/site-packages/ray/serialization.py", line 170, in _deserialize_msgpack_data
    python_objects = self._deserialize_pickle5_data(pickle5_data)
  File "/Users/alizaidi-msft/miniconda3/envs/ddm/lib/python3.7/site-packages/ray/serialization.py", line 160, in _deserialize_pickle5_data
    obj = pickle.loads(in_band)
ModuleNotFoundError: No module named 'torch_models'

It seems that ray is serializing the script but loses track of the module where this module is running from. Intriguingly, this same invocation works in the test suite where the tests are running from a subdirectory (commented out for the CI pipeline but uncommenting and running works 🤷 ).

Should I restructure the scripts in a way to make ray happier when running TuneSearchCV?

akzaidi commented 3 years ago

Resolved! The issue is how hydra handles output directories.