ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.62k stars 5.71k forks source link

Unable to connect to a ray cluster with via the client in version 2.7.0 when specifying a working_dir. #40104

Open mortenhoyer opened 1 year ago

mortenhoyer commented 1 year ago

What happened + What you expected to happen

Unable to connect to a ray cluster with via the client in version 2.7.0 when specifying a working_dir. Works in 2.6.3

Versions / Dependencies

ray 2.7.0 python 3.10 or 3.11, same result in both linux 9


Package                   Version
------------------------- ------------
aiohttp                   3.8.5
aiohttp-cors              0.7.0
aiorwlock                 1.3.0
aiosignal                 1.3.1
annotated-types           0.5.0
anyio                     3.7.1
async-timeout             4.0.3
attrs                     23.1.0
blessed                   1.20.0
boto3                     1.28.57
botocore                  1.31.57
cachetools                5.3.1
catboost                  1.2.2
certifi                   2023.7.22
charset-normalizer        3.3.0
click                     8.1.7
colorful                  0.5.5
contourpy                 1.1.1
cycler                    0.11.0
DateTime                  5.2
distlib                   0.3.7
fastapi                   0.103.2
filelock                  3.12.4
fonttools                 4.42.1
frozenlist                1.4.0
google-api-core           2.12.0
google-auth               2.23.2
googleapis-common-protos  1.60.0
gpustat                   1.1.1
graphviz                  0.20.1
grpcio                    1.59.0
h11                       0.14.0
idna                      3.4
jmespath                  1.0.1
joblib                    1.3.2
jsonschema                4.19.1
jsonschema-specifications 2023.7.1
kiwisolver                1.4.5
llvmlite                  0.41.0
matplotlib                3.8.0
msgpack                   1.0.7
multidict                 6.0.4
numba                     0.58.0
numpy                     1.25.2
nvidia-ml-py              12.535.108
opencensus                0.11.3
opencensus-context        0.1.3
packaging                 23.2
pandas                    2.1.1
patsy                     0.5.3
Pillow                    10.0.1
pip                       23.2.1
platformdirs              3.10.0
plotly                    5.17.0
progressbar2              4.2.0
prometheus-client         0.17.1
protobuf                  4.24.3
psutil                    5.9.5
psycopg2-binary           2.9.8
py-spy                    0.3.14
pyasn1                    0.5.0
pyasn1-modules            0.3.0
pydantic                  1.10.13
pydantic_core             2.10.1
pyparsing                 3.1.1
python-dateutil           2.8.2
python-utils              3.8.1
pytz                      2023.3.post1
PyYAML                    6.0.1
ray                       2.7.0
referencing               0.30.2
requests                  2.31.0
rpds-py                   0.10.3
rsa                       4.9
s3transfer                0.7.0
scikit-learn              1.3.1
scipy                     1.11.3
setuptools                65.5.0
six                       1.16.0
smart-open                6.4.0
sniffio                   1.3.0
sortedcontainers          2.4.0
starlette                 0.27.0
statsmodels               0.14.0
tenacity                  8.2.3
threadpoolctl             3.2.0
typing_extensions         4.8.0
tzdata                    2023.3
urllib3                   1.26.16
uvicorn                   0.23.2
virtualenv                20.21.0
watchfiles                0.20.0
wcwidth                   0.2.8
xgboost                   2.0.0
yarl                      1.9.2
zope.interface            6.0

Reproduction script

Start head node like this: ray start --head

Run this python script:

import ray

#this works:
#ray.init('ray://localhost:10001')

#this does not work:
ray.init('ray://localhost:10001', runtime_env={'working_dir': './'})

output:

2023-10-04 16:05:15,535 INFO packaging.py:518 -- Creating a file package for local directory '.'.
2023-10-04 16:05:15,536 INFO packaging.py:346 -- Pushing file package 'gcs://_ray_pkg_e66cfe77fbc2e668.zip' (0.00MiB) to Ray cluster...
2023-10-04 16:05:15,537 INFO packaging.py:359 -- Successfully pushed file package 'gcs://_ray_pkg_e66cfe77fbc2e668.zip'.
Traceback (most recent call last):
  File "/data/home/mhoyerusr/bin/test/test.py", line 7, in <module>
    ray.init('ray://localhost:10001', runtime_env={'working_dir': '.'})
  File "/home/mhoyerusr/packages/python-3.11.5/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/mhoyerusr/packages/python-3.11.5/lib/python3.11/site-packages/ray/_private/worker.py", line 1354, in init
    ctx = builder.connect()
          ^^^^^^^^^^^^^^^^^
  File "/home/mhoyerusr/packages/python-3.11.5/lib/python3.11/site-packages/ray/client_builder.py", line 173, in connect
    client_info_dict = ray.util.client_connect.connect(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mhoyerusr/packages/python-3.11.5/lib/python3.11/site-packages/ray/util/client_connect.py", line 55, in connect
    conn = ray.connect(
           ^^^^^^^^^^^^
  File "/home/mhoyerusr/packages/python-3.11.5/lib/python3.11/site-packages/ray/util/client/__init__.py", line 250, in connect
    conn = self.get_context().connect(*args, **kw_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mhoyerusr/packages/python-3.11.5/lib/python3.11/site-packages/ray/util/client/__init__.py", line 100, in connect
    self.client_worker._server_init(job_config, ray_init_kwargs)
  File "/home/mhoyerusr/packages/python-3.11.5/lib/python3.11/site-packages/ray/util/client/worker.py", line 855, in _server_init
    raise ConnectionAbortedError(
ConnectionAbortedError: Initialization failure from server:
Traceback (most recent call last):
  File "/home/mhoyerusr/packages/python-3.11.5/lib/python3.11/site-packages/ray/util/client/server/proxier.py", line 704, in Datapath
    if not self.proxy_manager.start_specific_server(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mhoyerusr/packages/python-3.11.5/lib/python3.11/site-packages/ray/util/client/server/proxier.py", line 305, in start_specific_server
    serialized_runtime_env_context = self._create_runtime_env(
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mhoyerusr/packages/python-3.11.5/lib/python3.11/site-packages/ray/util/client/server/proxier.py", line 281, in _create_runtime_env
    raise TimeoutError(
TimeoutError: GetOrCreateRuntimeEnv request failed after 5 attempts. Last exception: HTTP Error 403: Forbidden

Issue Severity

Medium: It is a significant difficulty but I can work around it.

sveint commented 7 months ago

I'm seeing same error message on 2.10.0, not sure where to look for debugging it.

EDIT: I'm using a proxy for internet. In my case this was due to not adding the local IPs of the head/workers to no_proxy environment variable when starting ray in the terminal, the 403 FORBIDDEN came from the proxy when ray head node (not client) tried to connect to the RuntimeEnvAgent using the local IP.

Fresh-Mint commented 5 months ago

Still have this exact same issue at 2.21.0. Changing no_proxy didn’t work for me

anyscalesam commented 5 months ago

seems like there's more folks in the community running into this @rynewang @jjyao did we ever root cause this?