ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.04k stars 5.59k forks source link

[Core] Failed to download runtime_env file package gcs://_xxx.zip from the GCS to the Ray worker node. #36703

Open wjzhou-ep opened 1 year ago

wjzhou-ep commented 1 year ago

What happened + What you expected to happen

  1. some time the GCS will delete the ray runtime After use ray for a while from notebook. we will get error message like: Failed to download runtime_env file package gcs://_ray_pkg_xxxx.zip.

The workaround is to change any file in the folder and re-run remote function.

  1. expected Not seeing this error

  2. info It seems, somehow, the client think the GCS has the runtime_uri while GCS don't have it. Change a file locally will change the hash, and the client will upload it.

Our env is about 1Mb, so I think the 10 min should be long enough for it.

Also, not sure if related, I see these in the runtime_env_agent.log

2023-06-21 20:03:30,070 INFO runtime_env_agent.py:339 -- Creating runtime env: {"working_dir": "gcs://_ray_pkg_07d964c2ad903c39.zip"} with timeout 600 seconds.
2023-06-21 20:03:30,071 INFO runtime_env_agent.py:497 -- Got request from raylet to decrease reference for runtime env: {"working_dir": "gcs://_ray_pkg_07d964c2ad903c39.zip"}.
2023-06-21 20:03:30,071 INFO runtime_env_agent.py:130 -- Unused runtime env {"working_dir": "gcs://_ray_pkg_07d964c2ad903c39.zip"}.
2023-06-21 20:03:30,071 INFO runtime_env_agent.py:111 -- Unused uris [('gcs://_ray_pkg_07d964c2ad903c39.zip', 'working_dir')].
2023-06-21 20:03:30,082 ERROR runtime_env_agent.py:365 -- Failed to create runtime env {"working_dir": "gcs://_ray_pkg_07d964c2ad903c39.zip"}.

full log:

2023-06-21 20:03:30,068 INFO runtime_env_agent.py:497 -- Got request from raylet to decrease reference for runtime env: {"working_dir": "gcs://_ray_pkg_62304e6af9aab441.zip"}.
2023-06-21 20:03:30,068 WARNING runtime_env_agent.py:128 -- Runtime env {"working_dir": "gcs://_ray_pkg_62304e6af9aab441.zip"} does not exist.
2023-06-21 20:03:30,068 WARNING runtime_env_agent.py:109 -- URI gcs://_ray_pkg_62304e6af9aab441.zip does not exist.
2023-06-21 20:03:30,069 INFO runtime_env_agent.py:339 -- Creating runtime env: {"working_dir": "gcs://_ray_pkg_62304e6af9aab441.zip"} with timeout 600 seconds.
2023-06-21 20:03:30,070 INFO runtime_env_agent.py:497 -- Got request from raylet to decrease reference for runtime env: {"working_dir": "gcs://_ray_pkg_07d964c2ad903c39.zip"}.
2023-06-21 20:03:30,070 WARNING runtime_env_agent.py:128 -- Runtime env {"working_dir": "gcs://_ray_pkg_07d964c2ad903c39.zip"} does not exist.
2023-06-21 20:03:30,070 WARNING runtime_env_agent.py:109 -- URI gcs://_ray_pkg_07d964c2ad903c39.zip does not exist.
2023-06-21 20:03:30,070 INFO runtime_env_agent.py:339 -- Creating runtime env: {"working_dir": "gcs://_ray_pkg_07d964c2ad903c39.zip"} with timeout 600 seconds.
2023-06-21 20:03:30,071 INFO runtime_env_agent.py:497 -- Got request from raylet to decrease reference for runtime env: {"working_dir": "gcs://_ray_pkg_07d964c2ad903c39.zip"}.
2023-06-21 20:03:30,071 INFO runtime_env_agent.py:130 -- Unused runtime env {"working_dir": "gcs://_ray_pkg_07d964c2ad903c39.zip"}.
2023-06-21 20:03:30,071 INFO runtime_env_agent.py:111 -- Unused uris [('gcs://_ray_pkg_07d964c2ad903c39.zip', 'working_dir')].
2023-06-21 20:03:30,082 ERROR runtime_env_agent.py:365 -- Failed to create runtime env {"working_dir": "gcs://_ray_pkg_07d964c2ad903c39.zip"}.
Traceback (most recent call last):
  File "/opt/conda/envs/artemis/lib/python3.10/site-packages/ray/dashboard/modules/runtime_env/runtime_env_agent.py", line 357, in _create_runtime_env_with_retry
    runtime_env_context = await asyncio.wait_for(
  File "/opt/conda/envs/artemis/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/opt/conda/envs/artemis/lib/python3.10/site-packages/ray/dashboard/modules/runtime_env/runtime_env_agent.py", line 312, in _setup_runtime_env
    await create_for_plugin_if_needed(
  File "/opt/conda/envs/artemis/lib/python3.10/site-packages/ray/_private/runtime_env/plugin.py", line 252, in create_for_plugin_if_needed
    size_bytes = await plugin.create(uri, runtime_env, context, logger=logger)
  File "/opt/conda/envs/artemis/lib/python3.10/site-packages/ray/_private/runtime_env/working_dir.py", line 155, in create
    local_dir = await download_and_unpack_package(
  File "/opt/conda/envs/artemis/lib/python3.10/site-packages/ray/_private/runtime_env/packaging.py", line 655, in download_and_unpack_package
    raise IOError(
OSError: Failed to download runtime_env file package gcs://_ray_pkg_07d964c2ad903c39.zip from the GCS to the Ray worker node. The package may have prematurely been deleted from the GCS due to a long upload time or a problem with Ray. Try setting the environment variable RAY_RUNTIME_ENV_TEMPORARY_REFERENCE_EXPIRATION_S  to a value larger than the upload time in seconds (the default is 600). If this fails, try re-running after making any change to a file in the file package.
2023-06-21 20:03:30,110 INFO runtime_env_agent.py:390 -- Successfully created runtime env: {"working_dir": "gcs://_ray_pkg_62304e6af9aab441.zip"}, the context: {"command_prefix": ["cd", "/tmp/ray/session_2023-06-21_20-03-27_995203_19/runtime_resources/working_dir_files/_ray_pkg_62304e6af9aab441", "&&"], "env_vars": {"PYTHONPATH": "/tmp/ray/session_2023-06-21_20-03-27_995203_19/runtime_resources/working_dir_files/_ray_pkg_62304e6af9aab441"}, "py_executable": "/opt/conda/envs/artemis/bin/python", "resources_dir": null, "container": {}, "java_jars": []}
2023-06-21 20:03:30,110 INFO runtime_env_agent.py:426 -- Runtime env already created successfully. Env: {"working_dir": "gcs://_ray_pkg_62304e6af9aab441.zip"}, context: {"command_prefix": ["cd", "/tmp/ray/session_2023-06-21_20-03-27_995203_19/runtime_resources/working_dir_files/_ray_pkg_62304e6af9aab441", "&&"], "env_vars": {"PYTHONPATH": "/tmp/ray/session_2023-06-21_20-03-27_995203_19/runtime_resources/working_dir_files/_ray_pkg_62304e6af9aab441"}, "py_executable": "/opt/conda/envs/artemis/bin/python", "resources_dir": null, "container": {}, "java_jars": []}
2023-06-21 20:03:31,086 ERROR runtime_env_agent.py:365 -- Failed to create runtime env {"working_dir": "gcs://_ray_pkg_07d964c2ad903c39.zip"}.
Traceback (most recent call last):
  File "/opt/conda/envs/artemis/lib/python3.10/site-packages/ray/dashboard/modules/runtime_env/runtime_env_agent.py", line 357, in _create_runtime_env_with_retry
    runtime_env_context = await asyncio.wait_for(
  File "/opt/conda/envs/artemis/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/opt/conda/envs/artemis/lib/python3.10/site-packages/ray/dashboard/modules/runtime_env/runtime_env_agent.py", line 312, in _setup_runtime_env
    await create_for_plugin_if_needed(
  File "/opt/conda/envs/artemis/lib/python3.10/site-packages/ray/_private/runtime_env/plugin.py", line 252, in create_for_plugin_if_needed
    size_bytes = await plugin.create(uri, runtime_env, context, logger=logger)
  File "/opt/conda/envs/artemis/lib/python3.10/site-packages/ray/_private/runtime_env/working_dir.py", line 155, in create
    local_dir = await download_and_unpack_package(
  File "/opt/conda/envs/artemis/lib/python3.10/site-packages/ray/_private/runtime_env/packaging.py", line 655, in download_and_unpack_package
    raise IOError(
OSError: Failed to download runtime_env file package gcs://_ray_pkg_07d964c2ad903c39.zip from the GCS to the Ray worker node. The package may have prematurely been deleted from the GCS due to a long upload time or a problem with Ray. Try setting the environment variable RAY_RUNTIME_ENV_TEMPORARY_REFERENCE_EXPIRATION_S  to a value larger than the upload time in seconds (the default is 600). If this fails, try re-running after making any change to a file in the file package.
2023-06-21 20:03:32,090 ERROR runtime_env_agent.py:365 -- Failed to create runtime env {"working_dir": "gcs://_ray_pkg_07d964c2ad903c39.zip"}.
Traceback (most recent call last):
  File "/opt/conda/envs/artemis/lib/python3.10/site-packages/ray/dashboard/modules/runtime_env/runtime_env_agent.py", line 357, in _create_runtime_env_with_retry
    runtime_env_context = await asyncio.wait_for(
  File "/opt/conda/envs/artemis/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/opt/conda/envs/artemis/lib/python3.10/site-packages/ray/dashboard/modules/runtime_env/runtime_env_agent.py", line 312, in _setup_runtime_env
    await create_for_plugin_if_needed(
  File "/opt/conda/envs/artemis/lib/python3.10/site-packages/ray/_private/runtime_env/plugin.py", line 252, in create_for_plugin_if_needed
    size_bytes = await plugin.create(uri, runtime_env, context, logger=logger)
  File "/opt/conda/envs/artemis/lib/python3.10/site-packages/ray/_private/runtime_env/working_dir.py", line 155, in create
    local_dir = await download_and_unpack_package(
  File "/opt/conda/envs/artemis/lib/python3.10/site-packages/ray/_private/runtime_env/packaging.py", line 655, in download_and_unpack_package
    raise IOError(
OSError: Failed to download runtime_env file package gcs://_ray_pkg_07d964c2ad903c39.zip from the GCS to the Ray worker node. The package may have prematurely been deleted from the GCS due to a long upload time or a problem with Ray. Try setting the environment variable RAY_RUNTIME_ENV_TEMPORARY_REFERENCE_EXPIRATION_S  to a value larger than the upload time in seconds (the default is 600). If this fails, try re-running after making any change to a file in the file package.
2023-06-21 20:03:33,091 ERROR runtime_env_agent.py:383 -- Runtime env creation failed for 3 times, don't retry any more.
2023-06-21 20:03:33,091 INFO runtime_env_agent.py:130 -- Unused runtime env {"working_dir": "gcs://_ray_pkg_07d964c2ad903c39.zip"}.
2023-06-21 20:03:33,091 INFO runtime_env_agent.py:111 -- Unused uris [('gcs://_ray_pkg_07d964c2ad903c39.zip', 'working_dir')].
2023-06-21 20:03:33,091 INFO runtime_env_agent.py:437 -- Runtime env already failed. Env: {"working_dir": "gcs://_ray_pkg_07d964c2ad903c39.zip"}, err: Traceback (most recent call last):
  File "/opt/conda/envs/artemis/lib/python3.10/site-packages/ray/dashboard/modules/runtime_env/runtime_env_agent.py", line 357, in _create_runtime_env_with_retry
    runtime_env_context = await asyncio.wait_for(
  File "/opt/conda/envs/artemis/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/opt/conda/envs/artemis/lib/python3.10/site-packages/ray/dashboard/modules/runtime_env/runtime_env_agent.py", line 312, in _setup_runtime_env
    await create_for_plugin_if_needed(
  File "/opt/conda/envs/artemis/lib/python3.10/site-packages/ray/_private/runtime_env/plugin.py", line 252, in create_for_plugin_if_needed
    size_bytes = await plugin.create(uri, runtime_env, context, logger=logger)
  File "/opt/conda/envs/artemis/lib/python3.10/site-packages/ray/_private/runtime_env/working_dir.py", line 155, in create
    local_dir = await download_and_unpack_package(
  File "/opt/conda/envs/artemis/lib/python3.10/site-packages/ray/_private/runtime_env/packaging.py", line 655, in download_and_unpack_package
    raise IOError(
OSError: Failed to download runtime_env file package gcs://_ray_pkg_07d964c2ad903c39.zip from the GCS to the Ray worker node. The package may have prematurely been deleted from the GCS due to a long upload time or a problem with Ray. Try setting the environment variable RAY_RUNTIME_ENV_TEMPORARY_REFERENCE_EXPIRATION_S  to a value larger than the upload time in seconds (the default is 600). If this fails, try re-running after making any change to a file in the file package.

2023-06-21 20:03:33,091 WARNING runtime_env_agent.py:128 -- Runtime env {"working_dir": "gcs://_ray_pkg_07d964c2ad903c39.zip"} does not exist.
2023-06-21 20:03:33,091 WARNING runtime_env_agent.py:109 -- URI gcs://_ray_pkg_07d964c2ad903c39.zip does not exist.

Versions / Dependencies

Ray: 2.4.0 Python: 3.10.0 OS: ubuntu 20.04 in docker Kuberay: 0.5.0

Reproduction script

Don't know how to reproduce this reliably.

Issue Severity

Low: It annoys or frustrates me.

nate-bush commented 1 year ago

I'm running into the same intermittent failure using Python 3.10, Ray 2.4.0, using the rayproject/ray-ml:2.4.0-py310-gpu image with some additional packages and pip installations. Fails after ~3 seconds.

jjyao commented 1 year ago

@wjzhou-ep @nate-bush, could you tell me how I can reproduce this?

Are you using ray client? How do you use runtime env? Do you have job level or task/actor level runtime env?

wjzhou-ep commented 1 year ago

Sorry, not yet for a reliable reproduce script.

Are you using ray client? We are using ray client from ipython notebook

How do you use runtime env? We are using local folder, something like

client = ray.init(
    runtime_env={
            "working_dir": _get_working_dir(),
            "excludes": ["*.ipynb", "tests", "folder_tmp", "*_test.py", "*.html"],
    },
)

Do you have job level or task/actor level runtime env? It's job level runtime env

E.g. we have an env, that the working_dir content is not changing for a long time, with multiply clients. Other than that, I don't recognize what's special with our setup.

FYI, please took as a grain of salt external redis (kuberay FT) might fixed the problem, but we are not sure, might just pure lucky.

sveint commented 4 weeks ago

I'm seeing the same issue every time I start new jobs in an scaled-to-0 cluster in a somewhat large codebase. I normally have to restart the script again once the node is up and runtime env has failed to install for unknown reasons (possibly the missing file). I'm willing to spend some time debugging into what goes wrong, but would love some pointers into how things are put together and how to debug it.