Neverending ray tuning #153

Closed DanielAtKrypton closed 3 years ago

DanielAtKrypton commented 3 years ago

Problem description

I am setting up a test framework for ray tune but unfortunately I got stuck when I was trying to tune the learning rate hyperparameter of a pipelined network.

The test code can be found here.

I noticed when debugging the test that the tuning spawns many threads as can be seen at the call stack to the left and below:


Despite I have already installed gpustat by running pip install gpustat, there is stihl a message warning me to install it. These threads stay open for hours and there is no other feedback in the terminal.

Is there anything I am missing to make the learning rate hyperparameter tuning work smoothly here?

Environment information

Vs Code


Python dependencies:

requiremets lock

Yard1 commented 3 years ago

What happens if you set n_jobs=1 in TuneGridSearchCV?

DanielAtKrypton commented 3 years ago

What happens if you set n_jobs=1 in TuneGridSearchCV?

The same behaviour as a result...

Yard1 commented 3 years ago

Can you try cv=2?

DanielAtKrypton commented 3 years ago

Can you try cv=2?

Sure. I got still the same behavior with n_jobs=1 and cv=2:

best_score, best_params = tsp.tune_grid_search(
DanielAtKrypton commented 3 years ago

My pip list within the virtual environment:

Yard1 commented 3 years ago

Can you try updating tune-sklearn to the version on github? pip install -U git+https://github.com/ray-project/tune-sklearn.git And also please make sure that your Ray version is up to date.

DanielAtKrypton commented 3 years ago

Can you try updating tune-sklearn to the version on github? pip install -U git+https://github.com/ray-project/tune-sklearn.git And also please make sure that your Ray version is up to date.

After I updated with the command above, it went to version 0.0.8. Now the test crashes with the following info:

Windows fatal exception: stack overflow

Thread 0x00003438 (most recent call first):
  File "C:\Python37\lib\threading.py", line 300 in wait
  File "C:\Python37\lib\threading.py", line 552 in wait
  File "c:\Users\Daniel\.vscode\extensions\ms-python.python-2020.11.371526539\pythonFiles\lib\python\debugpy\_vendored\pydevd\pydevd.py", line 232 in _on_run
  File "c:\Users\Daniel\.vscode\extensions\ms-python.python-2020.11.371526539\pythonFiles\lib\python\debugpy\_vendored\pydevd\_pydevd_bundle\pydevd_daemon_thread.py", line 46 in run
  File "C:\Python37\lib\threading.py", line 926 in _bootstrap_inner
  File "C:\Python37\lib\threading.py", line 890 in _bootstrap

Thread 0x00004a5c (most recent call first):
  File "C:\Python37\lib\threading.py", line 300 in wait
  File "C:\Python37\lib\threading.py", line 552 in wait
  File "c:\Users\Daniel\.vscode\extensions\ms-python.python-2020.11.371526539\pythonFiles\lib\python\debugpy\_vendored\pydevd\pydevd.py", line 186 in _on_run
  File "c:\Users\Daniel\.vscode\extensions\ms-python.python-2020.11.371526539\pythonFiles\lib\python\debugpy\_vendored\pydevd\_pydevd_bundle\pydevd_daemon_thread.py", line 46 in run
  File "C:\Python37\lib\threading.py", line 926 in _bootstrap_inner
  File "C:\Python37\lib\threading.py", line 890 in _bootstrap

Thread 0x00005e20 (most recent call first):
  File "C:\Python37\lib\threading.py", line 296 in wait
  File "C:\Python37\lib\threading.py", line 552 in wait
  File "c:\Users\Daniel\.vscode\extensions\ms-python.python-2020.11.371526539\pythonFiles\lib\python\debugpy\_vendored\pydevd\_pydevd_bundle\pydevd_timeout.py", line 43 in _on_run
  File "c:\Users\Daniel\.vscode\extensions\ms-python.python-2020.11.371526539\pythonFiles\lib\python\debugpy\_vendored\pydevd\_pydevd_bundle\pydevd_daemon_thread.py", line 46 in run
  File "C:\Python37\lib\threading.py", line 926 in _bootstrap_inner
  File "C:\Python37\lib\threading.py", line 890 in _bootstrap

Thread 0x000067e4 (most recent call first):
  File "c:\Users\Daniel\.vscode\extensions\ms-python.python-2020.11.371526539\pythonFiles\lib\python\debugpy\_vendored\pydevd\_pydevd_bundle\pydevd_comm.py", line 210 in _read_line
  File "c:\Users\Daniel\.vscode\extensions\ms-python.python-2020.11.371526539\pythonFiles\lib\python\debugpy\_vendored\pydevd\_pydevd_bundle\pydevd_comm.py", line 228 in _on_run
  File "c:\Users\Daniel\.vscode\extensions\ms-python.python-2020.11.371526539\pythonFiles\lib\python\debugpy\_vendored\pydevd\_pydevd_bundle\pydevd_daemon_thread.py", line 46 in run
  File "C:\Python37\lib\threading.py", line 926 in _bootstrap_inner
  File "C:\Python37\lib\threading.py", line 890 in _bootstrap

Thread 0x0000337c (most recent call first):
  File "C:\Python37\lib\threading.py", line 300 in wait
  File "C:\Python37\lib\queue.py", line 179 in get
  File "c:\Users\Daniel\.vscode\extensions\ms-python.python-2020.11.371526539\pythonFiles\lib\python\debugpy\_vendored\pydevd\_pydevd_bundle\pydevd_comm.py", line 339 in _on_run
  File "c:\Users\Daniel\.vscode\extensions\ms-python.python-2020.11.371526539\pythonFiles\lib\python\debugpy\_vendored\pydevd\_pydevd_bundle\pydevd_daemon_thread.py", line 46 in run
  File "C:\Python37\lib\threading.py", line 926 in _bootstrap_inner
  File "C:\Python37\lib\threading.py", line 890 in _bootstrap

Current thread 0x00005d54 (most recent call first):
  File "c:\Users\Daniel\.vscode\extensions\ms-python.python-2020.11.371526539\pythonFiles\lib\python\debugpy\_vendored\pydevd\_pydevd_bundle\pydevd_trace_dispatch_regular.py", line 364 in __call__
  File "c:\Users\Daniel\Workspaces\Python\time_series_predictor\.env\lib\site-packages\sklearn\utils\_pprint.py", line 335 in _safe_repr
  File "c:\Users\Daniel\Workspaces\Python\time_series_predictor\.env\lib\site-packages\sklearn\utils\_pprint.py", line 172 in format
  File "C:\Python37\lib\pprint.py", line 393 in _repr
  File "C:\Python37\lib\pprint.py", line 161 in _format
  File "C:\Python37\lib\pprint.py", line 144 in pformat
  File "c:\Users\Daniel\Workspaces\Python\time_series_predictor\time_series_predictor\sklearn\base.py", line 281 in __repr__
  File "c:\Users\Daniel\Workspaces\Python\time_series_predictor\.env\lib\site-packages\sklearn\utils\_pprint.py", line 437 in _safe_repr
  File "c:\Users\Daniel\Workspaces\Python\time_series_predictor\.env\lib\site-packages\sklearn\utils\_pprint.py", line 172 in format
  File "C:\Python37\lib\pprint.py", line 393 in _repr
  File "C:\Python37\lib\pprint.py", line 161 in _format
  File "C:\Python37\lib\pprint.py", line 144 in pformat
  File "c:\Users\Daniel\Workspaces\Python\time_series_predictor\time_series_predictor\sklearn\base.py", line 281 in __repr__
  File "c:\Users\Daniel\Workspaces\Python\time_series_predictor\.env\lib\site-packages\sklearn\utils\_pprint.py", line 437 in _safe_repr
  File "c:\Users\Daniel\Workspaces\Python\time_series_predictor\.env\lib\site-packages\sklearn\utils\_pprint.py", line 172 in format
  File "C:\Python37\lib\pprint.py", line 393 in _repr
  File "C:\Python37\lib\pprint.py", line 161 in _format
  File "C:\Python37\lib\pprint.py", line 144 in pformat
  File "c:\Users\Daniel\Workspaces\Python\time_series_predictor\time_series_predictor\sklearn\base.py", line 281 in __repr__
  File "c:\Users\Daniel\Workspaces\Python\time_series_predictor\.env\lib\site-packages\sklearn\utils\_pprint.py", line 437 in _safe_repr
  File "c:\Users\Daniel\Workspaces\Python\time_series_predictor\.env\lib\site-packages\sklearn\utils\_pprint.py", line 172 in format
  File "C:\Python37\lib\pprint.py", line 393 in _repr
  File "C:\Python37\lib\pprint.py", line 161 in _format
  File "C:\Python37\lib\pprint.py", line 144 in pformat
  File "c:\Users\Daniel\Workspaces\Python\time_series_predictor\time_series_predictor\sklearn\base.py", line 281 in __repr__
  File "c:\Users\Daniel\Workspaces\Python\time_series_predictor\.env\lib\site-packages\sklearn\utils\_pprint.py", line 437 in _safe_repr
  File "c:\Users\Daniel\Workspaces\Python\time_series_predictor\.env\lib\site-packages\sklearn\utils\_pprint.py", line 172 in format
  File "C:\Python37\lib\pprint.py", line 393 in _repr
  File "C:\Python37\lib\pprint.py", line 161 in _format
  File "C:\Python37\lib\pprint.py", line 144 in pformat
  File "c:\Users\Daniel\Workspaces\Python\time_series_predictor\time_series_predictor\sklearn\base.py", line 281 in __repr__
  File "c:\Users\Daniel\Workspaces\Python\time_series_predictor\.env\lib\site-packages\sklearn\utils\_pprint.py", line 437 in _safe_repr
  File "c:\Users\Daniel\Workspaces\Python\time_series_predictor\.env\lib\site-packages\sklearn\utils\_pprint.py", line 172 in format

I am using ray version 1.0.1.post1

Yard1 commented 3 years ago

That's quite odd. @richardliaw, @inventormc any ideas?

You can revert back to the previous version by u installing tune-sklearn and installing it normally again.

DanielAtKrypton commented 3 years ago

I reinstalled tune-sklearn. It got the version tune-sklearn-0.1.0. The behaviours is now the previous I reported here.

richardliaw commented 3 years ago

Hey @DanielAtKrypton, what are the commands to reproduce your stack?

Also, can you try running this outside vscode (i.e., just using a terminal)?

DanielAtKrypton commented 3 years ago

Hey @DanielAtKrypton, what are the commands to reproduce your stack?

Also, can you try running this outside vscode (i.e., just using a terminal)?

Sure, I just started the test:


I will leave it processing for now...

richardliaw commented 3 years ago

can you try instead with pytest -s -v ?

DanielAtKrypton commented 3 years ago

can you try instead with pytest -s -v ?

Sure. Here is the output: Imgur

richardliaw commented 3 years ago

OK got it; can you now try, in a python terminal:

import ray
def hello_world():
    return "hi"

DanielAtKrypton commented 3 years ago

There you go:


richardliaw commented 3 years ago

awesome, so now we know that the fundamental problem seems to be in ray core.

Can you try ray stop and run it again?

DanielAtKrypton commented 3 years ago

Still running. I will update as soon I have other output from the terminal...


richardliaw commented 3 years ago

OK got it, so ray.init() is just hanging forever?

DanielAtKrypton commented 3 years ago

OK got it, so ray.init() is just hanging forever?

Yes, unfortunately it is.

richardliaw commented 3 years ago

OK. Can you try:

pip install -U [latest wheel link for windows] as found here:


and if that doesn't work, try downgrading to pip install ray==1.0.0?

DanielAtKrypton commented 3 years ago

I installed the latest wheel for windows and python 3.7.

Now it is behaving like this:


I tried to open the dashboard in my browser but the browser was unable to connect there.

DanielAtKrypton commented 3 years ago

And ray status reports:


richardliaw commented 3 years ago

try ray stop a couple times, then try the hello world again?

DanielAtKrypton commented 3 years ago

try ray stop a couple times, then try the hello world again?

I tried a couple times. It starts and hangs forever...


richardliaw commented 3 years ago

Unfortunately this is a ray issue, and I'll close this and continue discussion on the ray side.

DanielAtKrypton commented 3 years ago

Source of this problem is being considered here.