Closed jmakov closed 3 years ago
Using log_to_file=True
, trialdir/stdout
and trialdir/stderr
also aren't present.
Here's a scenario where everything works except the ray dashboard:
os.environ["TUNE_PLACEMENT_GROUP_AUTO_DISABLED"] = "1"
resources_per_trial
is not usednum_samples
is <= available CPUs on the cluster, TensorBoard works. If this parameter is higher, Tune hangs, TensorBoard doesn't work.As discussed with @Yard1 on a call, when commenting out reuse_actors
, it works as expected.
After initial success (it ran great for about 15min), less and less CPUs on the cluster are used until the cluster has nothing to do and Tune hangs. ray monitor shows that all of the cluster's CPUs are in use. After another run (after cluster down/stop/up), it again starts processes on the whole cluster but hangs not after 15min but after a couple of seconds. The same behavior if I leave out the search_alg
argument.
Interesting, I'm wondering if this is actually due to the PB2 scheduler.
Can you share a full reproducible script (including a (fake) trainable, i.e. get_signal
and OM_process_tune
) and maybe the cluster config? Is this running on AWS?
Which Ray version are you running?
@krfricke We have removed the scheduler and it didn't impact anything. The Ray version is 1.6. @jmakov can share more information
@krfricke It's a local cluster started with ray up --no-config-cache cluster.yaml
. Don't have currently access to AWS or Google cloud. What currently works is using ConcurrencyLimiter
and commenting out reuse_actors=True
.
cluster.yaml:
cluster_name: default
provider:
type: local
head_ip: 192.168.0.101
worker_ips:
- 192.168.0.100
- 192.168.0.102
auth:
ssh_user: toaster
min_workers: 2
max_workers: 2
upscaling_speed: 1.0
idle_timeout_minutes: 5
file_mounts: {
"~/workspace_ray_cluster": "~/workspace/puma/src/puma_lab",
}
cluster_synced_files: []
file_mounts_sync_continuously: False
rsync_exclude:
- "**/.git"
- "**/.git/**"
rsync_filter:
- ".gitignore"
initialization_commands: []
setup_commands:
- conda env create -q -n puma-lab -f ~/workspace_ray_cluster/environment.yaml || conda env update -q -n puma-lab -f ~/workspace_ray_cluster/environment.yaml
head_setup_commands: []
worker_setup_commands: []
head_start_ray_commands:
- conda activate puma-lab && ray stop
- conda activate puma-lab && ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml
worker_start_ray_commands:
- conda activate puma-lab && ray stop
- conda activate puma-lab && ray start --address=$RAY_HEAD_IP:6379
environment.yaml:
name: puma-lab
channels:
- pyviz
- conda-forge
- defaults
dependencies:
- _libgcc_mutex=0.1=conda_forge
- _openmp_mutex=4.5=1_gnu
- abseil-cpp=20210324.2=h9c3ff4c_0
- alembic=1.7.3=pyhd8ed1ab_0
- alsa-lib=1.2.3=h516909a_0
- anyio=3.3.0=py37h89c1867_0
- argcomplete=1.12.3=pyhd8ed1ab_2
- argon2-cffi=20.1.0=py37h5e8e339_2
- arrow-cpp=5.0.0=py37hdf48254_5_cpu
- async_generator=1.10=py_0
- attrs=21.2.0=pyhd8ed1ab_0
- autopage=0.4.0=pyhd8ed1ab_0
- aws-c-cal=0.5.11=h95a6274_0
- aws-c-common=0.6.2=h7f98852_0
- aws-c-event-stream=0.2.7=h3541f99_13
- aws-c-io=0.10.5=hfb6a706_0
- aws-checksums=0.1.11=ha31a3da_7
- aws-sdk-cpp=1.8.186=hb4091e7_3
- babel=2.9.1=pyh44b312d_0
- backcall=0.2.0=pyh9f0ad1d_0
- backports=1.0=py_2
- backports.functools_lru_cache=1.6.4=pyhd8ed1ab_0
- backports.zoneinfo=0.2.1=py37h5e8e339_4
- bleach=4.1.0=pyhd8ed1ab_0
- bokeh=2.3.3=py37h89c1867_0
- brotlipy=0.7.0=py37h5e8e339_1001
- bzip2=1.0.8=h7f98852_4
- c-ares=1.17.2=h7f98852_0
- ca-certificates=2021.5.30=ha878542_0
- certifi=2021.5.30=py37h89c1867_0
- cffi=1.14.6=py37hc58025e_0
- chardet=4.0.0=py37h89c1867_1
- charset-normalizer=2.0.0=pyhd8ed1ab_0
- click=8.0.1=py37h89c1867_0
- clickhouse-cityhash=1.0.2.3=py37h3340039_2
- clickhouse-driver=0.2.1=py37h5e8e339_0
- cliff=3.9.0=pyhd8ed1ab_0
- cloudpickle=2.0.0=pyhd8ed1ab_0
- cmaes=0.8.2=pyh44b312d_0
- cmd2=2.2.0=py37h89c1867_0
- colorama=0.4.4=pyh9f0ad1d_0
- colorcet=2.0.6=pyhd8ed1ab_0
- colorlog=6.4.1=py37h89c1867_0
- conda=4.10.3=py37h89c1867_1
- conda-package-handling=1.7.3=py37h5e8e339_0
- cramjam=2.3.1=py37h5e8e339_1
- cryptography=3.4.7=py37h5d9358c_0
- cycler=0.10.0=py_2
- cytoolz=0.11.0=py37h5e8e339_3
- dask=2021.9.0=pyhd8ed1ab_0
- dask-core=2021.9.0=pyhd8ed1ab_0
- datashader=0.13.0=pyh6c4a22f_0
- datashape=0.5.4=py_1
- dbus=1.13.6=h48d8840_2
- debugpy=1.4.1=py37hcd2ae1e_0
- decorator=5.1.0=pyhd8ed1ab_0
- defusedxml=0.7.1=pyhd8ed1ab_0
- distributed=2021.9.0=py37h89c1867_0
- entrypoints=0.3=py37hc8dfbb8_1002
- expat=2.4.1=h9c3ff4c_0
- fastparquet=0.7.1=py37hb1e94ed_0
- filelock=3.0.12=pyh9f0ad1d_0
- fontconfig=2.13.1=hba837de_1005
- freetype=2.10.4=h0708190_1
- fsspec=2021.8.1=pyhd8ed1ab_0
- gettext=0.19.8.1=h0b5b191_1005
- gflags=2.2.2=he1b5a44_1004
- gitdb=4.0.7=pyhd8ed1ab_0
- gitpython=3.1.23=pyhd8ed1ab_1
- glib=2.68.4=h9c3ff4c_0
- glib-tools=2.68.4=h9c3ff4c_0
- glog=0.5.0=h48cff8f_0
- greenlet=1.1.1=py37hcd2ae1e_0
- grpc-cpp=1.40.0=h850795e_0
- gst-plugins-base=1.18.5=hf529b03_0
- gstreamer=1.18.5=h76c114f_0
- heapdict=1.0.1=py_0
- holoviews=1.14.5=py_0
- hvplot=0.7.3=py_0
- icu=68.1=h58526e2_0
- idna=3.1=pyhd3deb0d_0
- importlib-metadata=4.8.1=py37h89c1867_0
- importlib_metadata=4.8.1=hd8ed1ab_0
- importlib_resources=5.2.2=pyhd8ed1ab_0
- ipykernel=6.4.1=py37h6531663_0
- ipympl=0.7.0=pyhd8ed1ab_0
- ipython=7.27.0=py37h6531663_0
- ipython_genutils=0.2.0=py_1
- ipywidgets=7.6.5=pyhd8ed1ab_0
- jbig=2.1=h7f98852_2003
- jedi=0.18.0=py37h89c1867_2
- jinja2=3.0.1=pyhd8ed1ab_0
- joblib=1.0.1=pyhd8ed1ab_0
- jpeg=9d=h36c2ea0_0
- json5=0.9.5=pyh9f0ad1d_0
- jsonschema=3.2.0=py37hc8dfbb8_1
- jupyter-server-mathjax=0.2.3=pyhd8ed1ab_0
- jupyter_client=7.0.2=pyhd8ed1ab_0
- jupyter_contrib_core=0.3.3=py_2
- jupyter_contrib_nbextensions=0.5.1=py37hc8dfbb8_1
- jupyter_core=4.7.1=py37h89c1867_0
- jupyter_highlight_selected_word=0.2.0=py37h89c1867_1002
- jupyter_latex_envs=1.4.6=py37h89c1867_1001
- jupyter_nbextensions_configurator=0.4.1=py37h89c1867_2
- jupyter_server=1.11.0=pyhd8ed1ab_0
- jupyterlab=3.1.11=pyhd8ed1ab_0
- jupyterlab-git=0.32.2=pyhd8ed1ab_0
- jupyterlab_pygments=0.1.2=pyh9f0ad1d_0
- jupyterlab_server=2.8.1=pyhd8ed1ab_0
- jupyterlab_widgets=1.0.2=pyhd8ed1ab_0
- kiwisolver=1.3.2=py37h2527ec5_0
- krb5=1.19.2=hcc1bbae_0
- lcms2=2.12=hddcbb42_0
- ld_impl_linux-64=2.36.1=hea4e1c9_2
- lerc=2.2.1=h9c3ff4c_0
- libarchive=3.5.2=hccf745f_0
- libblas=3.9.0=11_linux64_openblas
- libbrotlicommon=1.0.9=h7f98852_5
- libbrotlidec=1.0.9=h7f98852_5
- libbrotlienc=1.0.9=h7f98852_5
- libcblas=3.9.0=11_linux64_openblas
- libclang=11.1.0=default_ha53f305_1
- libcurl=7.78.0=h2574ce0_0
- libdeflate=1.7=h7f98852_5
- libedit=3.1.20191231=he28a2e2_2
- libev=4.33=h516909a_1
- libevent=2.1.10=hcdb4288_3
- libffi=3.3=h58526e2_2
- libgcc-ng=11.1.0=hc902ee8_8
- libgfortran-ng=11.1.0=h69a702a_8
- libgfortran5=11.1.0=h6c583b3_8
- libglib=2.68.4=h3e27bee_0
- libgomp=11.1.0=hc902ee8_8
- libiconv=1.16=h516909a_0
- liblapack=3.9.0=11_linux64_openblas
- libllvm11=11.1.0=hf817b99_2
- libnghttp2=1.43.0=h812cca2_0
- libogg=1.3.4=h7f98852_1
- libopenblas=0.3.17=pthreads_h8fe5266_1
- libopus=1.3.1=h7f98852_1
- libpng=1.6.37=h21135ba_2
- libpq=13.3=hd57d9b9_0
- libprotobuf=3.16.0=h780b84a_0
- libsodium=1.0.18=h36c2ea0_1
- libsolv=0.7.19=h780b84a_5
- libssh2=1.10.0=ha56f1ee_0
- libstdcxx-ng=11.1.0=h56837e0_8
- libta-lib=0.4.0=h516909a_0
- libthrift=0.14.2=he6d91bd_1
- libtiff=4.3.0=hf544144_1
- libutf8proc=2.6.1=h7f98852_0
- libuuid=2.32.1=h7f98852_1000
- libuv=1.42.0=h7f98852_0
- libvorbis=1.3.7=h9c3ff4c_0
- libwebp-base=1.2.1=h7f98852_0
- libxcb=1.13=h7f98852_1003
- libxkbcommon=1.0.3=he3ba5ed_0
- libxml2=2.9.12=h72842e0_0
- libxslt=1.1.33=h15afd5d_2
- llvmlite=0.37.0=py37h9d7f4d0_0
- locket=0.2.0=py_2
- lxml=4.6.3=py37h77fd288_0
- lz4-c=1.9.3=h9c3ff4c_1
- lzo=2.10=h516909a_1000
- mako=1.1.5=pyhd8ed1ab_0
- mamba=0.15.3=py37h7f483ca_0
- markdown=3.3.4=pyhd8ed1ab_0
- markupsafe=2.0.1=py37h5e8e339_0
- matplotlib=3.4.3=py37h89c1867_0
- matplotlib-base=3.4.3=py37h1058ff1_0
- matplotlib-inline=0.1.3=pyhd8ed1ab_0
- mistune=0.8.4=py37h5e8e339_1004
- modin-core=0.10.2=py37h89c1867_1
- modin-ray=0.10.2=py37h89c1867_1
- msgpack-python=1.0.2=py37h2527ec5_1
- multipledispatch=0.6.0=py_0
- mysql-common=8.0.25=ha770c72_2
- mysql-libs=8.0.25=hfa10184_2
- nb_conda_kernels=2.3.1=py37h89c1867_0
- nbclassic=0.3.1=pyhd8ed1ab_1
- nbclient=0.5.4=pyhd8ed1ab_0
- nbconvert=6.1.0=py37h89c1867_0
- nbdime=3.1.0=pyhd8ed1ab_0
- nbformat=5.1.3=pyhd8ed1ab_0
- ncurses=6.2=h58526e2_4
- nest-asyncio=1.5.1=pyhd8ed1ab_0
- nodejs=16.6.1=h92b4a50_0
- notebook=6.4.3=pyha770c72_0
- nspr=4.30=h9c3ff4c_0
- nss=3.69=hb5efdd6_0
- numba=0.54.0=py37h2d894fd_0
- numpy=1.20.3=py37h038b26d_1
- olefile=0.46=pyh9f0ad1d_1
- openjpeg=2.4.0=hb52868f_1
- openssl=1.1.1l=h7f98852_0
- optuna=2.9.1=pyhd8ed1ab_0
- orc=1.6.10=h58a87f1_0
- packaging=21.0=pyhd8ed1ab_0
- pandas=1.3.2=py37he8f5f7f_0
- pandoc=2.14.2=h7f98852_0
- pandocfilters=1.4.2=py_1
- panel=0.12.1=py_0
- param=1.11.1=pyh6c4a22f_0
- parquet-cpp=1.5.1=1
- parso=0.8.2=pyhd8ed1ab_0
- partd=1.2.0=pyhd8ed1ab_0
- pbr=5.6.0=pyhd8ed1ab_0
- pcre=8.45=h9c3ff4c_0
- pexpect=4.8.0=py37hc8dfbb8_1
- pickle5=0.0.11=py37h5e8e339_0
- pickleshare=0.7.5=py37hc8dfbb8_1002
- pillow=8.3.2=py37h0f21c89_0
- pip=21.2.4=pyhd8ed1ab_0
- prettytable=2.2.0=pyhd8ed1ab_0
- prometheus_client=0.11.0=pyhd8ed1ab_0
- prompt-toolkit=3.0.20=pyha770c72_0
- psutil=5.8.0=py37h5e8e339_1
- pthread-stubs=0.4=h36c2ea0_1001
- ptyprocess=0.7.0=pyhd3deb0d_0
- pyarrow=5.0.0=py37h58331f5_5_cpu
- pycosat=0.6.3=py37h5e8e339_1006
- pycparser=2.20=pyh9f0ad1d_2
- pyct=0.4.6=py_0
- pyct-core=0.4.6=py_0
- pygments=2.10.0=pyhd8ed1ab_0
- pykalman=0.9.5=py_1
- pyopenssl=20.0.1=pyhd8ed1ab_0
- pyparsing=2.4.7=pyh9f0ad1d_0
- pyperclip=1.8.2=pyhd8ed1ab_2
- pyqt=5.12.3=py37h89c1867_7
- pyqt-impl=5.12.3=py37he336c9b_7
- pyqt5-sip=4.19.18=py37hcd2ae1e_7
- pyqtchart=5.12=py37he336c9b_7
- pyqtwebengine=5.12.1=py37he336c9b_7
- pyrsistent=0.17.3=py37h5e8e339_2
- pysocks=1.7.1=py37h89c1867_3
- python=3.7.10=hffdb5ce_100_cpython
- python-dateutil=2.8.2=pyhd8ed1ab_0
- python_abi=3.7=2_cp37m
- pytz=2021.1=pyhd8ed1ab_0
- pyviz_comms=2.1.0=py_0
- pyyaml=5.4.1=py37h5e8e339_1
- pyzmq=22.2.1=py37h336d617_0
- qt=5.12.9=hda022c4_4
- ray-core=1.6.0=py37hf931bba_0
- ray-tune=1.6.0=py37h89c1867_0
- re2=2021.09.01=h9c3ff4c_0
- readline=8.1=h46c0cb4_0
- redis-py=3.5.3=pyh9f0ad1d_0
- reproc=14.2.3=h7f98852_0
- reproc-cpp=14.2.3=h9c3ff4c_0
- requests=2.26.0=pyhd8ed1ab_0
- requests-unixsocket=0.2.0=py_0
- ruamel_yaml=0.15.80=py37h5e8e339_1004
- s2n=1.0.10=h9b69904_0
- scikit-learn=0.24.2=py37hf0f1638_1
- send2trash=1.8.0=pyhd8ed1ab_0
- setproctitle=1.1.10=py37h5e8e339_1004
- setuptools=58.0.4=py37h89c1867_0
- six=1.16.0=pyh6c4a22f_0
- smmap=3.0.5=pyh44b312d_0
- snappy=1.1.8=he1b5a44_3
- sniffio=1.2.0=py37h89c1867_1
- sortedcontainers=2.4.0=pyhd8ed1ab_0
- sqlalchemy=1.4.25=py37h5e8e339_0
- sqlite=3.36.0=h9cd32fc_1
- stevedore=3.4.0=py37h89c1867_0
- ta-lib=0.4.19=py37ha21ca33_2
- tabulate=0.8.9=pyhd8ed1ab_0
- tblib=1.7.0=pyhd8ed1ab_0
- tensorboardx=2.4=pyhd8ed1ab_0
- terminado=0.12.1=py37h89c1867_0
- testpath=0.5.0=pyhd8ed1ab_0
- threadpoolctl=2.2.0=pyh8a188c0_0
- thrift=0.13.0=py37hcd2ae1e_2
- tk=8.6.11=h27826a3_1
- toolz=0.11.1=py_0
- tornado=6.1=py37h5e8e339_1
- tqdm=4.62.2=pyhd8ed1ab_0
- traitlets=5.1.0=pyhd8ed1ab_0
- typing_extensions=3.10.0.0=pyha770c72_0
- tzdata=2021a=he74cb21_1
- tzlocal=3.0=py37h89c1867_2
- urllib3=1.26.6=pyhd8ed1ab_0
- wcwidth=0.2.5=pyh9f0ad1d_2
- webencodings=0.5.1=py_1
- websocket-client=0.57.0=py37h89c1867_4
- wheel=0.37.0=pyhd8ed1ab_1
- widgetsnbextension=3.5.1=py37h89c1867_4
- xarray=0.19.0=pyhd8ed1ab_1
- xeus=2.0.0=h7d0c39e_0
- xeus-python=0.13.0=py37h4b46df4_1
- xeus-python-shell=0.1.5=pyhd8ed1ab_0
- xorg-libxau=1.0.9=h7f98852_0
- xorg-libxdmcp=1.1.3=h7f98852_0
- xz=5.2.5=h516909a_1
- yaml=0.2.5=h516909a_0
- zeromq=4.3.4=h9c3ff4c_1
- zict=2.0.0=py_0
- zipp=3.5.0=pyhd8ed1ab_0
- zlib=1.2.11=h516909a_1010
- zstandard=0.15.2=py37h5e8e339_0
- zstd=1.5.0=ha95c52a_0
- pip:
- absl-py==0.13.0
- aiohttp==3.7.4.post0
- aiohttp-cors==0.7.0
- aioredis==1.3.1
- async-timeout==3.0.1
- autograd==1.3
- bayesian-optimization==1.2.0
- blessings==1.7
- cachetools==4.2.2
- cma==2.7.0
- colorful==0.5.4
- cython==0.29.24
- future==0.18.2
- google-api-core==1.31.2
- google-auth==1.35.0
- google-auth-oauthlib==0.4.6
- googleapis-common-protos==1.53.0
- gpustat==0.6.0
- gpy==1.10.0
- gpytorch==1.5.1
- grpcio==1.40.0
- hebo==0.1.0
- hiredis==2.0.0
- multidict==5.1.0
- nevergrad==0.4.3.post8
- nvidia-ml-py3==7.352.0
- oauthlib==3.1.1
- opencensus==0.7.13
- opencensus-context==0.1.2
- paramz==0.9.5
- protobuf==3.17.3
- py-spy==0.3.9
- pyasn1==0.4.8
- pyasn1-modules==0.2.8
- pymoo==0.4.2.2
- ray==1.6.0
- requests-oauthlib==1.3.0
- rsa==4.7.2
- scipy==1.5.4
- sklearn==0.0
- tensorboard==2.6.0
- tensorboard-data-server==0.6.1
- tensorboard-plugin-wit==1.8.0
- torch==1.9.1
- werkzeug==2.0.1
- yarl==1.6.3
@krfricke @Yard1 I think I found a part of the problem at least. This example hangs with or without reuse_actors
. Looks like a modin
-ray
interoperability issue:
import modin.pandas as pd
import ray
from ray import tune
from ray.tune.suggest.basic_variant import BasicVariantGenerator
ray.init(address='auto', _redis_password='xxx')
def easy_objective(config, data):
data_df = data[0]
# Here be dragons. If either of the below lines are included, Tune hangs.
score = int(pd.DataFrame(pd.Series(df.test), columns=["test"]).explode(["test"]).test.sum())
# pd.DataFrame(pd.Series(df.test), columns=["test"]).explode(["test"])
# pd.DataFrame(pd.Series(df.test), columns=["test"]).sum()
tune.report(score=score)
tune.run(
tune.with_parameters(easy_objective, data=[df.index.values, df.bid.values, df.ask.values, df.decimals_price[0]]),
name="test_study",
time_budget_s=3600*24*3,
num_samples=-1,
verbose=3,
fail_fast=True,
config={
"steps": 100,
"width": tune.uniform(0, 20),
"height": tune.uniform(-100, 100),
"activation": tune.grid_search(["relu", "tanh"])
},
metric="score",
mode="max",
# but works with this enabled
# search_alg=BasicVariantGenerator(max_concurrent=CLUSTER_AVAILABLE_LOGICAL_CPUS - 1), #N.B. "-1", else hangs
)
I also vote we rename the project from ray
to dragons_everywhere
:P.
Yeah, looks like Tune is taking up all CPU resources, making modin operations inside the trainable deadlocked. This is also why limiting concurrency fixes the issue, as if frees up enough CPUs for modin to work.
Opened an issue on the modin
project: https://github.com/modin-project/modin/issues/3479. Not sure when they will respond but if it takes a couple of days perhaps we can just update ray docs?
Upon discussing the issue further with @Yard1, what also works is resources_per_trial={"cpu":0,"extra_cpu":1}
. In this case though ray monitor reports 3 to 5 CPUs in use in the whole cluster (52 avail.) but almost all CPUs are working. Another observation with resources_per_trial
solution: node1 has 12 CPUs, load is 20, node2 8 CPUs, load 20, node3 (head node) 32 CPUs, load 18. Is the head node intentionally underutilized or is ray
just equally distributing workload among nodes?
Using ConcurrencyLimiter(..., max_concurrent=AVAIL_CPUS_ON_CLUSTER - 2)
the workload is better handled: node1: 13, node2: 8, node3: 31.
Search before asking
Ray Component
Ray Tune
What happened + What you expected to happen
Update:
After investigating, it appears thatLooks like an issue betweenreuse_actors=True
is the culprit, causing a cluster to hang with unfulfilled resource requirements. Setting to False solves the issue.modin
andray
.tune.run()
starts work on a local cluster. After a couple of minutes less and less CPUs are used. After no CPU is utilized,tune.run()
still hasn't finished. The expected behavior is that aftertune.run()
all cluster resources are utilized untiltune.run()
finishes.Additional info:
ray monitor cluster.yaml
shows that all CPUs are in use.The same behavior occurs with/without:
tune.run(...)
output:Reproduction script
Anything else
ray up cluster.yaml
time_budget_s
is not respectedNo dashboards are active for the current data set.
) whereas~/ray_results/
is not emptyAre you willing to submit a PR?