ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.95k stars 5.58k forks source link

[Core] Expose logs for runtime environment installation process on worker nodes for remote Ray clusters #34310

Open brosand opened 1 year ago

brosand commented 1 year ago

Description

When submitting a job to a remote Ray cluster, the job fails silently with empty logs. After analyzing the code, it appears that the problem occurs during the installation of the runtime environment, which may not take place on the worker node. Additionally, the logs for this process are not exposed anywhere that I can find, and at the very least, are not linked to the job itself.

Use case

As a user of Ray, I need to be able to see the logs of the runtime environment installation process on the worker nodes when submitting a job to a remote Ray cluster. This will help me identify any issues that may be preventing my job from running successfully and allow me to troubleshoot and resolve the issue more efficiently.

Additionally, it would be helpful if the logs for the runtime environment installation process were linked to the job itself, so that I can easily identify which logs correspond to which job. This will make it easier to debug any issues that may occur during the runtime environment installation process.

cadedaniel commented 1 year ago

Thanks for opening this issue @brosand. I think it's a good point. Users should be able to see dependency setup logs.

To unblock you, there are files in /tmp/ray/session_latest/logs that have the runtime env setup, e.g. runtime_env_setup-01000000.log.

architkulkarni commented 1 year ago

Hi @brosand, sorry you're running into this. Do you have a reproduction for the part about "failing silently"? That could be a bug. Currently the expected behavior is that the job fails with the status message including the traceback, e.g.

ray job submit --runtime-env-json='{"pip": ["does-not-exist"]}' -- echo hi

Job submission server address: http://127.0.0.1:8265

-------------------------------------------------------
Job 'raysubmit_CHKt5sk6Sr2VkDgf' submitted successfully
-------------------------------------------------------

Next steps
  Query the logs of the job:
    ray job logs raysubmit_CHKt5sk6Sr2VkDgf
  Query the status of the job:
    ray job status raysubmit_CHKt5sk6Sr2VkDgf
  Request the job to be stopped:
    ray job stop raysubmit_CHKt5sk6Sr2VkDgf

Tailing logs until the job exits (disable with --no-wait):

---------------------------------------
Job 'raysubmit_CHKt5sk6Sr2VkDgf' failed
---------------------------------------

Status message: runtime_env setup failed: Failed to set up runtime environment.
Could not create the actor because its associated runtime env failed to be created.
Traceback (most recent call last):
[...]
ray._private.runtime_env.utils.SubprocessCalledProcessError: Run cmd[9] failed with the following details.
Command '['/tmp/ray/session_2023-04-17_13-09-44_690267_18442/runtime_resources/pip/5dccbbcc6d1279caf878bf88e7c96c45527d553a/virtualenv/bin/python', '-m', 'pip', 'install', '--disable-pip-version-check', '--no-cache-dir', '-r', '/tmp/ray/session_2023-04-17_13-09-44_690267_18442/runtime_resources/pip/5dccbbcc6d1279caf878bf88e7c96c45527d553a/requirements.txt']' returned non-zero exit status 1.
Last 50 lines of stdout:
    ERROR: Could not find a version that satisfies the requirement does-not-exist (from versions: none)
    ERROR: No matching distribution found for does-not-exist

Agree that ideally there should be a way to get the runtime env logs using the Jobs API, similar to the existing job log API (which only gets the output of the entrypoint script).

rkooo567 commented 1 year ago

It is not a core issue (added the dashboard label). I think we can support this from the dashboard & state API (ray list runtime-envs) instead of directly looking at log files