ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.05k stars 5.59k forks source link

[Bug] Tune crashes with "RuntimeError: Trial#8161 has already finished and can not be updated" in ray 1.7.0 #19274

Closed jmakov closed 2 years ago

jmakov commented 2 years ago

Search before asking

Ray Component

Ray Tune

What happened + What you expected to happen

After upgrade to ray 1.7.0 (from 1.6.0), my script exits with an exception (previously only warnings were there).

Reproduction script

Script is using:

os.environ["TUNE_DISABLE_AUTO_CALLBACK_LOGGERS"] = "1"  # https://github.com/ray-project/ray/issues/18903
os.environ["TUNE_DISABLE_AUTO_CALLBACK_SYNCER"] = "1"  # https://github.com/ray-project/ray/issues/18903
os.environ["TUNE_RESULT_BUFFER_LENGTH"] = "0"  # if 0 report trial result immediately so that trials don't run speculatively

Warnings and exception from the script:

2021-10-09 19:49:42,867 WARNING ray_trial_executor.py:772 -- Over the last 60 seconds, the Tune event loop has been backlogged processing new results. Consider increasing your period of result reporting to improve performance.                                                                                           
2021-10-09 19:49:43,407 WARNING util.py:166 -- The `on_step_end` operation took 0.535 s, which may be a performance bottleneck.                                                                                                                                                                                              
2021-10-09 19:50:01,586 WARNING util.py:166 -- The `on_step_end` operation took 0.530 s, which may be a performance bottleneck.                                                                                                                                                                                              
2021-10-09 19:50:07,729 WARNING util.py:166 -- The `on_step_end` operation took 0.527 s, which may be a performance bottleneck.                                                                                                                                                                                              
2021-10-09 19:50:13,967 WARNING util.py:166 -- The `on_step_end` operation took 0.530 s, which may be a performance bottleneck.                                                                                                                                                                                              
2021-10-09 19:50:20,777 WARNING util.py:166 -- The `on_step_end` operation took 0.541 s, which may be a performance bottleneck.                                                                                                                                                                                              
2021-10-09 19:50:33,234 WARNING util.py:166 -- The `on_step_end` operation took 0.545 s, which may be a performance bottleneck.                                                                                                                                                                                              
2021-10-09 19:50:43,765 WARNING ray_trial_executor.py:772 -- Over the last 60 seconds, the Tune event loop has been backlogged processing new results. Consider increasing your period of result reporting to improve performance.                                                                                           
2021-10-09 19:50:51,181 WARNING util.py:166 -- The `on_step_end` operation took 0.552 s, which may be a performance bottleneck.                                                                                                                                                                                              
2021-10-09 19:50:57,041 WARNING util.py:166 -- The `on_step_end` operation took 0.544 s, which may be a performance bottleneck.                                                                                                                                                                                              
2021-10-09 19:51:03,231 WARNING util.py:166 -- The `on_step_end` operation took 0.551 s, which may be a performance bottleneck.                                                                                                                                                                                              
2021-10-09 19:51:09,011 WARNING util.py:166 -- The `on_step_end` operation took 0.538 s, which may be a performance bottleneck.                                                                                                                                                                                              
2021-10-09 19:51:14,703 WARNING util.py:166 -- The `on_step_end` operation took 0.565 s, which may be a performance bottleneck.                                                                                                                                                                                              
2021-10-09 19:51:37,931 WARNING util.py:166 -- The `on_step_end` operation took 0.533 s, which may be a performance bottleneck.                                                                                                                                                                                              
2021-10-09 19:51:45,019 WARNING ray_trial_executor.py:772 -- Over the last 60 seconds, the Tune event loop has been backlogged processing new results. Consider increasing your period of result reporting to improve performance.                                                                                           
/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/trial/_trial.py:592: UserWarning: The reported value is ignored because this `step` 1 is already reported.
/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/trial/_trial.py:592: UserWarning: The reported value is ignored because this `step` 1 is already reported.
  step
2021-10-09 19:51:57,036 ERROR trial_runner.py:924 -- Trial trainable_71ecb376: Error processing event.
Traceback (most recent call last):
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 898, in _process_trial
    decision = self._process_trial_result(trial, result)
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 951, in _process_trial_result
    trial.trial_id, result=flat_result)
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/suggest/search_generator.py", line 132, in on_trial_complete
    trial_id=trial_id, result=result, error=error)
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/suggest/optuna.py", line 400, in on_trial_complete
    self._ot_study.tell(ot_trial, val, state=ot_trial_state)
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/study/study.py", line 662, in tell
    self._storage.set_trial_values(trial_id, values)
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/storages/_in_memory.py", line 330, in set_trial_values
    self.check_trial_is_updatable(trial_id, trial.state)
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/storages/_base.py", line 723, in check_trial_is_updatable
    "Trial#{} has already finished and can not be updated.".format(trial.number)
RuntimeError: Trial#8161 has already finished and can not be updated.
Traceback (most recent call last):
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 898, in _process_trial
    decision = self._process_trial_result(trial, result)
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 951, in _process_trial_result
    trial.trial_id, result=flat_result)
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/suggest/search_generator.py", line 132, in on_trial_complete
    trial_id=trial_id, result=result, error=error)
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/suggest/optuna.py", line 400, in on_trial_complete
    self._ot_study.tell(ot_trial, val, state=ot_trial_state)
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/study/study.py", line 662, in tell
    self._storage.set_trial_values(trial_id, values)
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/storages/_in_memory.py", line 330, in set_trial_values
    self.check_trial_is_updatable(trial_id, trial.state)
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/storages/_base.py", line 723, in check_trial_is_updatable
    "Trial#{} has already finished and can not be updated.".format(trial.number)
RuntimeError: Trial#8161 has already finished and can not be updated.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "notebooks/factor/price_prediction.py", line 168, in <module>
    reuse_actors=True
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/tune.py", line 581, in run
    runner.step()
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 705, in step
    self._process_events(timeout=timeout)
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 863, in _process_events
    self._process_trial(trial)
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 925, in _process_trial
    self._process_trial_failure(trial, traceback.format_exc())
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 1139, in _process_trial_failure
    self._search_alg.on_trial_complete(trial.trial_id, error=True)
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/suggest/search_generator.py", line 132, in on_trial_complete
    trial_id=trial_id, result=result, error=error)
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/suggest/optuna.py", line 400, in on_trial_complete
    self._ot_study.tell(ot_trial, val, state=ot_trial_state)
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/study/study.py", line 664, in tell
    self._storage.set_trial_state(trial_id, state)
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/storages/_in_memory.py", line 223, in set_trial_state
    self.check_trial_is_updatable(trial_id, trial.state)
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/storages/_base.py", line 723, in check_trial_is_updatable
    "Trial#{} has already finished and can not be updated.".format(trial.number)
RuntimeError: Trial#8161 has already finished and can not be updated.

Anything else

Conda's env.yaml:

name: puma-lab
channels:
  - pyviz
  - conda-forge
  - defaults
dependencies:
  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=1_gnu
  - abseil-cpp=20210324.2=h9c3ff4c_0
  - alembic=1.7.3=pyhd8ed1ab_0
  - alsa-lib=1.2.3=h516909a_0
  - anyio=3.3.0=py37h89c1867_0
  - argcomplete=1.12.3=pyhd8ed1ab_2
  - argon2-cffi=20.1.0=py37h5e8e339_2
  - arrow-cpp=5.0.0=py37hdf48254_5_cpu
  - async_generator=1.10=py_0
  - attrs=21.2.0=pyhd8ed1ab_0
  - autopage=0.4.0=pyhd8ed1ab_0
  - aws-c-cal=0.5.11=h95a6274_0
  - aws-c-common=0.6.2=h7f98852_0
  - aws-c-event-stream=0.2.7=h3541f99_13
  - aws-c-io=0.10.5=hfb6a706_0
  - aws-checksums=0.1.11=ha31a3da_7
  - aws-sdk-cpp=1.8.186=hb4091e7_3
  - babel=2.9.1=pyh44b312d_0
  - backcall=0.2.0=pyh9f0ad1d_0
  - backports=1.0=py_2
  - backports.functools_lru_cache=1.6.4=pyhd8ed1ab_0
  - backports.zoneinfo=0.2.1=py37h5e8e339_4
  - bleach=4.1.0=pyhd8ed1ab_0
  - bokeh=2.3.3=py37h89c1867_0
  - brotlipy=0.7.0=py37h5e8e339_1001
  - bzip2=1.0.8=h7f98852_4
  - c-ares=1.17.2=h7f98852_0
  - ca-certificates=2021.5.30=ha878542_0
  - certifi=2021.5.30=py37h89c1867_0
  - cffi=1.14.6=py37hc58025e_0
  - chardet=4.0.0=py37h89c1867_1
  - charset-normalizer=2.0.0=pyhd8ed1ab_0
  - click=8.0.1=py37h89c1867_0
  - clickhouse-cityhash=1.0.2.3=py37h3340039_2
  - clickhouse-driver=0.2.1=py37h5e8e339_0
  - cliff=3.9.0=pyhd8ed1ab_0
  - cloudpickle=2.0.0=pyhd8ed1ab_0
  - cmaes=0.8.2=pyh44b312d_0
  - cmd2=2.2.0=py37h89c1867_0
  - colorama=0.4.4=pyh9f0ad1d_0
  - colorcet=2.0.6=pyhd8ed1ab_0
  - colorlog=6.4.1=py37h89c1867_0
  - conda=4.10.3=py37h89c1867_1
  - conda-package-handling=1.7.3=py37h5e8e339_0
  - cramjam=2.3.1=py37h5e8e339_1
  - cryptography=3.4.7=py37h5d9358c_0
  - cycler=0.10.0=py_2
  - cytoolz=0.11.0=py37h5e8e339_3
  - dask=2021.9.0=pyhd8ed1ab_0
  - dask-core=2021.9.0=pyhd8ed1ab_0
  - datashader=0.13.0=pyh6c4a22f_0
  - datashape=0.5.4=py_1
  - dbus=1.13.6=h48d8840_2
  - debugpy=1.4.1=py37hcd2ae1e_0
  - decorator=5.1.0=pyhd8ed1ab_0
  - defusedxml=0.7.1=pyhd8ed1ab_0
  - distributed=2021.9.0=py37h89c1867_0
  - entrypoints=0.3=py37hc8dfbb8_1002
  - expat=2.4.1=h9c3ff4c_0
  - fastparquet=0.7.1=py37hb1e94ed_0
  - filelock=3.0.12=pyh9f0ad1d_0
  - fontconfig=2.13.1=hba837de_1005
  - freetype=2.10.4=h0708190_1
  - fsspec=2021.8.1=pyhd8ed1ab_0
  - gettext=0.19.8.1=h0b5b191_1005
  - gflags=2.2.2=he1b5a44_1004
  - gitdb=4.0.7=pyhd8ed1ab_0
  - gitpython=3.1.23=pyhd8ed1ab_1
  - glib=2.68.4=h9c3ff4c_0
  - glib-tools=2.68.4=h9c3ff4c_0
  - glog=0.5.0=h48cff8f_0
  - greenlet=1.1.1=py37hcd2ae1e_0
  - grpc-cpp=1.40.0=h850795e_0
  - gst-plugins-base=1.18.5=hf529b03_0
  - gstreamer=1.18.5=h76c114f_0
  - heapdict=1.0.1=py_0
  - holoviews=1.14.5=py_0
  - hvplot=0.7.3=py_0
  - icu=68.1=h58526e2_0
  - idna=3.1=pyhd3deb0d_0
  - importlib-metadata=4.8.1=py37h89c1867_0
  - importlib_metadata=4.8.1=hd8ed1ab_0
  - importlib_resources=5.2.2=pyhd8ed1ab_0
  - ipykernel=6.4.1=py37h6531663_0
  - ipympl=0.7.0=pyhd8ed1ab_0
  - ipython=7.27.0=py37h6531663_0
  - ipython_genutils=0.2.0=py_1
  - ipywidgets=7.6.5=pyhd8ed1ab_0
  - jbig=2.1=h7f98852_2003
  - jedi=0.18.0=py37h89c1867_2
  - jinja2=3.0.1=pyhd8ed1ab_0
  - joblib=1.0.1=pyhd8ed1ab_0
  - jpeg=9d=h36c2ea0_0
  - json5=0.9.5=pyh9f0ad1d_0
  - jsonschema=3.2.0=py37hc8dfbb8_1
  - jupyter-server-mathjax=0.2.3=pyhd8ed1ab_0
  - jupyter_client=7.0.2=pyhd8ed1ab_0
  - jupyter_contrib_core=0.3.3=py_2
  - jupyter_contrib_nbextensions=0.5.1=py37hc8dfbb8_1
  - jupyter_core=4.7.1=py37h89c1867_0
  - jupyter_highlight_selected_word=0.2.0=py37h89c1867_1002
  - jupyter_latex_envs=1.4.6=py37h89c1867_1001
  - jupyter_nbextensions_configurator=0.4.1=py37h89c1867_2
  - jupyter_server=1.11.0=pyhd8ed1ab_0
  - jupyterlab=3.1.11=pyhd8ed1ab_0
  - jupyterlab-git=0.32.2=pyhd8ed1ab_0
  - jupyterlab_pygments=0.1.2=pyh9f0ad1d_0
  - jupyterlab_server=2.8.1=pyhd8ed1ab_0
  - jupyterlab_widgets=1.0.2=pyhd8ed1ab_0
  - kiwisolver=1.3.2=py37h2527ec5_0
  - krb5=1.19.2=hcc1bbae_0
  - lcms2=2.12=hddcbb42_0
  - ld_impl_linux-64=2.36.1=hea4e1c9_2
  - lerc=2.2.1=h9c3ff4c_0
  - libarchive=3.5.2=hccf745f_0
  - libblas=3.9.0=11_linux64_openblas
  - libbrotlicommon=1.0.9=h7f98852_5
  - libbrotlidec=1.0.9=h7f98852_5
  - libbrotlienc=1.0.9=h7f98852_5
  - libcblas=3.9.0=11_linux64_openblas
  - libclang=11.1.0=default_ha53f305_1
  - libcurl=7.78.0=h2574ce0_0
  - libdeflate=1.7=h7f98852_5
  - libedit=3.1.20191231=he28a2e2_2
  - libev=4.33=h516909a_1
  - libevent=2.1.10=hcdb4288_3
  - libffi=3.3=h58526e2_2
  - libgcc-ng=11.1.0=hc902ee8_8
  - libgfortran-ng=11.1.0=h69a702a_8
  - libgfortran5=11.1.0=h6c583b3_8
  - libglib=2.68.4=h3e27bee_0
  - libgomp=11.1.0=hc902ee8_8
  - libiconv=1.16=h516909a_0
  - liblapack=3.9.0=11_linux64_openblas
  - libllvm11=11.1.0=hf817b99_2
  - libnghttp2=1.43.0=h812cca2_0
  - libogg=1.3.4=h7f98852_1
  - libopenblas=0.3.17=pthreads_h8fe5266_1
  - libopus=1.3.1=h7f98852_1
  - libpng=1.6.37=h21135ba_2
  - libpq=13.3=hd57d9b9_0
  - libprotobuf=3.16.0=h780b84a_0
  - libsodium=1.0.18=h36c2ea0_1
  - libsolv=0.7.19=h780b84a_5
  - libssh2=1.10.0=ha56f1ee_0
  - libstdcxx-ng=11.1.0=h56837e0_8
  - libta-lib=0.4.0=h516909a_0
  - libthrift=0.14.2=he6d91bd_1
  - libtiff=4.3.0=hf544144_1
  - libutf8proc=2.6.1=h7f98852_0
  - libuuid=2.32.1=h7f98852_1000
  - libuv=1.42.0=h7f98852_0
  - libvorbis=1.3.7=h9c3ff4c_0
  - libwebp-base=1.2.1=h7f98852_0
  - libxcb=1.13=h7f98852_1003
  - libxkbcommon=1.0.3=he3ba5ed_0
  - libxml2=2.9.12=h72842e0_0
  - libxslt=1.1.33=h15afd5d_2
  - llvmlite=0.37.0=py37h9d7f4d0_0
  - locket=0.2.0=py_2
  - lxml=4.6.3=py37h77fd288_0
  - lz4-c=1.9.3=h9c3ff4c_1
  - lzo=2.10=h516909a_1000
  - mako=1.1.5=pyhd8ed1ab_0
  - mamba=0.15.3=py37h7f483ca_0
  - markdown=3.3.4=pyhd8ed1ab_0
  - markupsafe=2.0.1=py37h5e8e339_0
  - matplotlib=3.4.3=py37h89c1867_0
  - matplotlib-base=3.4.3=py37h1058ff1_0
  - matplotlib-inline=0.1.3=pyhd8ed1ab_0
  - mistune=0.8.4=py37h5e8e339_1004
  - modin-core=0.10.2=py37h89c1867_3
  - modin-ray=0.10.2=py37h89c1867_3
  - msgpack-python=1.0.2=py37h2527ec5_1
  - multipledispatch=0.6.0=py_0
  - mysql-common=8.0.25=ha770c72_2
  - mysql-libs=8.0.25=hfa10184_2
  - nb_conda_kernels=2.3.1=py37h89c1867_0
  - nbclassic=0.3.1=pyhd8ed1ab_1
  - nbclient=0.5.4=pyhd8ed1ab_0
  - nbconvert=6.1.0=py37h89c1867_0
  - nbdime=3.1.0=pyhd8ed1ab_0
  - nbformat=5.1.3=pyhd8ed1ab_0
  - ncurses=6.2=h58526e2_4
  - nest-asyncio=1.5.1=pyhd8ed1ab_0
  - notebook=6.4.3=pyha770c72_0
  - nspr=4.30=h9c3ff4c_0
  - nss=3.69=hb5efdd6_0
  - numba=0.54.0=py37h2d894fd_0
  - numpy=1.20.3=py37h038b26d_1
  - olefile=0.46=pyh9f0ad1d_1
  - openjpeg=2.4.0=hb52868f_1
  - openssl=1.1.1l=h7f98852_0
  - optuna=2.9.1=pyhd8ed1ab_0
  - orc=1.6.10=h58a87f1_0
  - packaging=21.0=pyhd8ed1ab_0
  - pandas=1.3.2=py37he8f5f7f_0
  - pandoc=2.14.2=h7f98852_0
  - pandocfilters=1.4.2=py_1
  - panel=0.12.1=py_0
  - param=1.11.1=pyh6c4a22f_0
  - parquet-cpp=1.5.1=1
  - parso=0.8.2=pyhd8ed1ab_0
  - partd=1.2.0=pyhd8ed1ab_0
  - patsy=0.5.2=pyhd8ed1ab_0
  - pbr=5.6.0=pyhd8ed1ab_0
  - pcre=8.45=h9c3ff4c_0
  - pexpect=4.8.0=py37hc8dfbb8_1
  - pickle5=0.0.11=py37h5e8e339_0
  - pickleshare=0.7.5=py37hc8dfbb8_1002
  - pillow=8.3.2=py37h0f21c89_0
  - pip=21.2.4=pyhd8ed1ab_0
  - prettytable=2.2.0=pyhd8ed1ab_0
  - prometheus_client=0.11.0=pyhd8ed1ab_0
  - prompt-toolkit=3.0.20=pyha770c72_0
  - psutil=5.8.0=py37h5e8e339_1
  - pthread-stubs=0.4=h36c2ea0_1001
  - ptyprocess=0.7.0=pyhd3deb0d_0
  - pyarrow=5.0.0=py37h58331f5_5_cpu
  - pycosat=0.6.3=py37h5e8e339_1006
  - pycparser=2.20=pyh9f0ad1d_2
  - pyct=0.4.6=py_0
  - pyct-core=0.4.6=py_0
  - pygments=2.10.0=pyhd8ed1ab_0
  - pykalman=0.9.5=py_1
  - pyopenssl=20.0.1=pyhd8ed1ab_0
  - pyparsing=2.4.7=pyh9f0ad1d_0
  - pyperclip=1.8.2=pyhd8ed1ab_2
  - pyqt=5.12.3=py37h89c1867_7
  - pyqt-impl=5.12.3=py37he336c9b_7
  - pyqt5-sip=4.19.18=py37hcd2ae1e_7
  - pyqtchart=5.12=py37he336c9b_7
  - pyqtwebengine=5.12.1=py37he336c9b_7
  - pyrsistent=0.17.3=py37h5e8e339_2
  - pysocks=1.7.1=py37h89c1867_3
  - python=3.7.10=hffdb5ce_100_cpython
  - python-dateutil=2.8.2=pyhd8ed1ab_0
  - python_abi=3.7=2_cp37m
  - pytz=2021.1=pyhd8ed1ab_0
  - pyviz_comms=2.1.0=py_0
  - pyyaml=5.4.1=py37h5e8e339_1
  - pyzmq=22.2.1=py37h336d617_0
  - qt=5.12.9=hda022c4_4
  - re2=2021.09.01=h9c3ff4c_0
  - readline=8.1=h46c0cb4_0
  - redis-py=3.5.3=pyh9f0ad1d_0
  - reproc=14.2.3=h7f98852_0
  - reproc-cpp=14.2.3=h9c3ff4c_0
  - requests=2.26.0=pyhd8ed1ab_0
  - requests-unixsocket=0.2.0=py_0
  - ruamel_yaml=0.15.80=py37h5e8e339_1004
  - s2n=1.0.10=h9b69904_0
  - scikit-learn=0.24.2=py37hf0f1638_1
  - send2trash=1.8.0=pyhd8ed1ab_0
  - setproctitle=1.1.10=py37h5e8e339_1004
  - setuptools=58.0.4=py37h89c1867_0
  - six=1.16.0=pyh6c4a22f_0
  - smmap=3.0.5=pyh44b312d_0
  - snappy=1.1.8=he1b5a44_3
  - sniffio=1.2.0=py37h89c1867_1
  - sortedcontainers=2.4.0=pyhd8ed1ab_0
  - sqlalchemy=1.4.25=py37h5e8e339_0
  - sqlite=3.36.0=h9cd32fc_1
  - statsmodels=0.12.2=py37hb1e94ed_0
  - stevedore=3.4.0=py37h89c1867_0
  - ta-lib=0.4.19=py37ha21ca33_2
  - tabulate=0.8.9=pyhd8ed1ab_0
  - tblib=1.7.0=pyhd8ed1ab_0
  - tensorboardx=2.4=pyhd8ed1ab_0
  - terminado=0.12.1=py37h89c1867_0
  - testpath=0.5.0=pyhd8ed1ab_0
  - threadpoolctl=2.2.0=pyh8a188c0_0
  - thrift=0.13.0=py37hcd2ae1e_2
  - tk=8.6.11=h27826a3_1
  - toolz=0.11.1=py_0
  - tornado=6.1=py37h5e8e339_1
  - tqdm=4.62.2=pyhd8ed1ab_0
  - traitlets=5.1.0=pyhd8ed1ab_0
  - typing_extensions=3.10.0.0=pyha770c72_0
  - tzdata=2021a=he74cb21_1
  - tzlocal=3.0=py37h89c1867_2
  - urllib3=1.26.6=pyhd8ed1ab_0
  - wcwidth=0.2.5=pyh9f0ad1d_2
  - webencodings=0.5.1=py_1
  - websocket-client=0.57.0=py37h89c1867_4
  - wheel=0.37.0=pyhd8ed1ab_1
  - widgetsnbextension=3.5.1=py37h89c1867_4
  - xarray=0.19.0=pyhd8ed1ab_1
  - xeus=2.0.0=h7d0c39e_0
  - xeus-python=0.13.0=py37h4b46df4_1
  - xeus-python-shell=0.1.5=pyhd8ed1ab_0
  - xorg-libxau=1.0.9=h7f98852_0
  - xorg-libxdmcp=1.1.3=h7f98852_0
  - xz=5.2.5=h516909a_1
  - yaml=0.2.5=h516909a_0
  - zeromq=4.3.4=h9c3ff4c_1
  - zict=2.0.0=py_0
  - zipp=3.5.0=pyhd8ed1ab_0
  - zlib=1.2.11=h516909a_1010
  - zstandard=0.15.2=py37h5e8e339_0
  - zstd=1.5.0=ha95c52a_0
  - pip:
    - absl-py==0.13.0
    - aiohttp==3.7.4.post0
    - aiohttp-cors==0.7.0
    - aioredis==1.3.1
    - async-timeout==3.0.1
    - autograd==1.3
    - bayesian-optimization==1.2.0
    - blessings==1.7
    - cachetools==4.2.2
    - cma==2.7.0
    - colorful==0.5.4
    - cython==0.29.24
    - future==0.18.2
    - google-api-core==1.31.2
    - google-auth==1.35.0
    - google-auth-oauthlib==0.4.6
    - googleapis-common-protos==1.53.0
    - gpustat==0.6.0
    - gpy==1.10.0
    - gpytorch==1.5.1
    - grpcio==1.40.0
    - hebo==0.1.0
    - hiredis==2.0.0
    - multidict==5.1.0
    - nevergrad==0.4.3.post8
    - nvidia-ml-py3==7.352.0
    - oauthlib==3.1.1
    - opencensus==0.7.13
    - opencensus-context==0.1.2
    - paramz==0.9.5
    - protobuf==3.17.3
    - py-spy==0.3.9
    - pyasn1==0.4.8
    - pyasn1-modules==0.2.8
    - pymoo==0.4.2.2
    - ray==1.7.0
    - requests-oauthlib==1.3.0
    - rsa==4.7.2
    - scipy==1.5.4
    - sklearn==0.0
    - tensorboard==2.6.0
    - tensorboard-data-server==0.6.1
    - tensorboard-plugin-wit==1.8.0
    - torch==1.9.1
    - werkzeug==2.0.1
    - yarl==1.6.3

Are you willing to submit a PR?

jmakov commented 2 years ago

Doesn't crash if os.environ["TUNE_RESULT_BUFFER_LENGTH"] = "0" is commented out.

krfricke commented 2 years ago

Do you have a script we can try ourselves? Otherwise it's hard to fix this

jmakov commented 2 years ago

@krfricke here you go:

import os
import random

os.environ["TUNE_DISABLE_AUTO_CALLBACK_LOGGERS"] = "1"  # https://github.com/ray-project/ray/issues/18903
os.environ["TUNE_DISABLE_AUTO_CALLBACK_SYNCER"] = "1"  # https://github.com/ray-project/ray/issues/18903
os.environ["TUNE_RESULT_BUFFER_LENGTH"] = "0"  

import numpy as np
import ray
from ray import tune
from ray.tune.suggest import optuna

def evaluation_fn():
    return random.randint(1, 10_000)

def easy_objective(config, data):
    intermediate_score = evaluation_fn()
    tune.report(mean_loss=intermediate_score)

if __name__ == "__main__":
    ray.init(address='auto', _redis_password='xxx')
    df = np.zeros(10_000_000)
    search_optuna = optuna.OptunaSearch()
    analysis = tune.run(
        tune.with_parameters(easy_objective, data=df),
        name="test",
        metric="mean_loss",
        mode="max",
        search_alg=search_optuna,
        num_samples=-1,
        config={
            "width": tune.uniform(0, 20),
            "height": tune.uniform(-100, 100)
        },
        reuse_actors=True,
        fail_fast=True,
        verbose=1
    )

On my 3 node cluster it crashes with:

== Status ==                                                                   
Memory usage on this node: 15.1/31.3 GiB                                                                                                                      
Using FIFO scheduling algorithm.                                               
Resources requested: 51.0/52 CPUs, 0/2 GPUs, 0.0/102.14 GiB heap, 0.0/47.77 GiB objects (0.0/1.0 accelerator_type:GT, 0.0/1.0 accelerator_type:G)                                                                                                                                                                            
Current best trial: a485397c with mean_loss=10000 and parameters={'width': 7.240691732613056, 'height': -58.746246442990405}                                                                                                                                                                                                 
Result logdir: /home/toaster/ray_results/test                                                                                                                 
Number of trials: 27650/infinite (1 PENDING, 51 RUNNING, 27598 TERMINATED)                                                                                    

2021-10-14 20:42:07,566 ERROR trial_runner.py:846 -- Trial easy_objective_57d58564: Error processing event.                                                                                                                                                                                                                  
Traceback (most recent call last):                                             
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 820, in _process_trial                                                                                                                                                                                      
    decision = self._process_trial_result(trial, result)                                                                                                      
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 873, in _process_trial_result                                                                                                                                                                               
    trial.trial_id, result=flat_result)                                        
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/suggest/search_generator.py", line 132, in on_trial_complete                                                                                                                                                                       
    trial_id=trial_id, result=result, error=error)                                                                                                            
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/suggest/optuna.py", line 385, in on_trial_complete                                                                                                                                                                                 
    self._ot_study.tell(ot_trial, val, state=ot_trial_state)                                                                                                  
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/study/study.py", line 662, in tell                                                                                                                                                                                                   
    self._storage.set_trial_values(trial_id, values)                                                                                                          
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/storages/_in_memory.py", line 330, in set_trial_values                                                                                                                                                                               
    self.check_trial_is_updatable(trial_id, trial.state)                                                                                                      
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/storages/_base.py", line 723, in check_trial_is_updatable                                                                                                                                                                            
    "Trial#{} has already finished and can not be updated.".format(trial.number)                                                                              
RuntimeError: Trial#5562 has already finished and can not be updated.                                                                                         
Traceback (most recent call last):                                             
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 820, in _process_trial                                                                                                                                                                                      
    decision = self._process_trial_result(trial, result)                                                                                                      
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 873, in _process_trial_result                                                                                                                                                                               
    trial.trial_id, result=flat_result)                                        
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/suggest/search_generator.py", line 132, in on_trial_complete                                                                                                                                                                       
    trial_id=trial_id, result=result, error=error)                                                                                                            
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/suggest/optuna.py", line 385, in on_trial_complete                                                                                                                                                                                 
    self._ot_study.tell(ot_trial, val, state=ot_trial_state)                                                                                                  
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/study/study.py", line 662, in tell                                                                                                                                                                                                   
    self._storage.set_trial_values(trial_id, values)                                                                                                          
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/storages/_in_memory.py", line 330, in set_trial_values                                                                                                                                                                               
    self.check_trial_is_updatable(trial_id, trial.state)                                                                                                      
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/storages/_base.py", line 723, in check_trial_is_updatable                                                                                                                                                                            
    "Trial#{} has already finished and can not be updated.".format(trial.number)                                                                              
RuntimeError: Trial#5562 has already finished and can not be updated.                                                                                         

During handling of the above exception, another exception occurred:                                                                                           

Traceback (most recent call last):                                             
  File "test.py", line 44, in <module>                                         
    verbose=1                                                                  
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/tune.py", line 588, in run                                                                                                                                                                                                         
    runner.step()                                                              
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 627, in step                                                                                                                                                                                                
    self._process_events(timeout=timeout)                                                                                                                     
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 785, in _process_events                                                                                                                                                                                     
    self._process_trial(trial)                                                 
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 847, in _process_trial                                                                                                                                                                                      
    self._process_trial_failure(trial, traceback.format_exc())                                                                                                
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 1058, in _process_trial_failure                                                                                                                                                                             
    self._search_alg.on_trial_complete(trial.trial_id, error=True)                                                                                            
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/suggest/search_generator.py", line 132, in on_trial_complete                                                                                                                                                                       
    trial_id=trial_id, result=result, error=error)                                                                                                            
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/suggest/optuna.py", line 385, in on_trial_complete                                                                                                                                                                                 
    self._ot_study.tell(ot_trial, val, state=ot_trial_state)                                                                                                  
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/study/study.py", line 664, in tell                                                                                                                                                                                                   
    self._storage.set_trial_state(trial_id, state)                                                                                                            
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/storages/_in_memory.py", line 223, in set_trial_state                                                                                                                                                                                
    self.check_trial_is_updatable(trial_id, trial.state)                                                                                                      
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/storages/_base.py", line 723, in check_trial_is_updatable                                                                                                                                                                            
    "Trial#{} has already finished and can not be updated.".format(trial.number)                                                                              
RuntimeError: Trial#5562 has already finished and can not be updated.              
jmakov commented 2 years ago

This is still present in ray 1.7.1 and nightly. This is a complete blocker since not only is Tune already slow (see https://github.com/ray-project/ray/issues/18903#issuecomment-951504211 - x10 slower after a few hours), but it crashes after a few hours on the cluster.

krfricke commented 2 years ago

Sorry for the late reply.

I couldn't reproduce the error with various settings for buffered results. However, we've cleaned up Ray Tune's buffering in the past months, so it might be that the error is resolved by some of those changes.

The core problem here lies in how Optuna handles duplicate results. Of course, trial completion shouldn't be called twice anyways.

https://github.com/ray-project/ray/pull/23495 addresses these problems so that Optuna wouldn't crash after receiving these results.