ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.05k stars 5.59k forks source link

[Ray Core] GCS expires due to unmodified files in the working-dir #46625

Open yx367563 opened 2 months ago

yx367563 commented 2 months ago

Description

I found that when running the program, an error message occasionally appears: OSError: Failed to download runtime_env file package gcs://xxx.zip from the GCS to the Ray worker node. The package may have prematurely been deleted from the GCS due to a long upload time or a problem with Ray. Try setting the environment variable RAY_RUNTIME_ENV_TEMPORARY_REFERENCE_EXPIRATION_S to a value larger than the upload time in seconds (the default is 600). If this fails, try re-running after making any change to a file in the file package.

Use case

If you simply submit the same task and do not modify the files in the working-dir, Ray will not re-upload the compressed package even if the file has expired. Perhaps you should determine whether the file is expired to decide whether it needs to be re-uploaded?

Also, I'd like to ask where the gcs://.. directory stored in working_dir is located. Is it the official remote file storage system of Ray or on the head node of Ray Cluster? I saw that some people do not recommend using Ray's gcs://.. because it seems to be unstable

rynewang commented 2 months ago

@yx367563 can you provide a repro script and procedure for this? Thanks

yx367563 commented 2 months ago

@rynewang There is no special reproduction code. You only need to set py_modules to a local whl file in ray.init, and then call it repeatedly to trigger it.

FantasticEthan commented 3 weeks ago

@rynewang . I encountered the same issue and suspect that it might be due to the failure to locate the corresponding code at this step: https://github.com/ray-project/ray/blob/c50e3b66d30110d92274a3477a27992e0a3e09fb/python/ray/_private/runtime_env/packaging.py#L659. Additionally, during the task runtime environment setup phase, I noticed the log message: ‘Runtime env working_dir gcs://xx.zip is already installed and will be reused.’ You should search all runtime_env_setup-*.log files to find the corresponding setup log.