ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.18k stars 5.61k forks source link

[jobs] Submit, Stop, and Delete Job Hanging on a Ray Cluster After a Period of Time #38605

Open VishDev12 opened 1 year ago

VishDev12 commented 1 year ago

What happened + What you expected to happen

All job operations work as expected when the cluster is launched and for a period of time (not deterministic) after that.

When I list jobs through the CLI, SDK, or the Dashboard API, it continues to work fine forever. But after that non-deterministic period of time, when submitting, stopping, or deleting a job, it times out roughly after 5 minutes with this error for submit job and a generic exception for stop and delete:

No available agent to submit job, please try again later.

The source of that error is job_head.py in these three methods: submit_job, stop_job, and delete_job. And each of them share this snippet in the try block:

job_agent_client = await asyncio.wait_for(
    self.choose_agent(),
    timeout=dashboard_consts.WAIT_AVAILABLE_AGENT_TIMEOUT,
)

followed by, correspondingly:

  1. resp = await job_agent_client.submit_job_internal(submit_request)
  2. resp = await job_agent_client.stop_job_internal(job.submission_id)
  3. resp = await job_agent_client.delete_job_internal(job.submission_id)

To narrow things down, I tried the above internal functions which seem to be accessing this API route: /api/job_agent/jobs/. And this worked perfectly fine. I was able to submit a new job, stop it, and delete a job.

This narrowed down the issue to choose_agent causing the timeout, where this loop is the most likely culprit I think:

https://github.com/ray-project/ray/blob/586c376e0769082cb5cfa1333e8264a5fa6b73ec/dashboard/modules/job/job_head.py#L185-L194

Here, DataOrganizer's get_all_agent_infos iterates over the DataSource singleton dictionary and returns the information. So the only way for the flow to be stuck in that loop is if get_all_agent_infos() returns an empty dictionary, meaning agent_infos in the while loop is None, thus continuing the loop forever until the request times out after 5 minutes.

For the returned dictionary to be empty, DataSource.agents has to be empty, but not always empty, since it worked fine when the cluster launched, so some process in between has to be emptying it out.

And searching across the code-base, I noticed only one place where DataSource.agents could be manipulated.

https://github.com/ray-project/ray/blob/586c376e0769082cb5cfa1333e8264a5fa6b73ec/dashboard/modules/node/node_head.py#L181-L201

To check this, I connected to the GCS with:

from ray._private.gcs_utils import GcsAioClient

gcs_client = GcsAioClient(address="localhost:6379")

r = await gcs_client.internal_kv_keys(b"DASHBOARD_AGENT_PORT_PREFIX", b'dashboard')

s = await gcs_client.internal_kv_multi_get(r, b'dashboard')

print(s)
>>> 
{b'DASHBOARD_AGENT_PORT_PREFIX:860f23034e922b6872120ca06ab2864d853d57a75402a2d1a2d09ea2': b'[52365, 64235]',
 b'DASHBOARD_AGENT_PORT_PREFIX:42bfd765db71ac3ff2ef61b34b9d9ecfac721cdcdb19c8a97c2e0154': b'[52365, 48568]',
 b'DASHBOARD_AGENT_PORT_PREFIX:e2e7b198a5fa6669a5955a68d0f9cef1c17bc7b86530f5791c066575': b'[52365, 48758]',
 b'DASHBOARD_AGENT_PORT_PREFIX:2c65da02273c9a99f2632cf70b919d325d457fb6a53b132fe6f27b73': b'[52365, 53181]',
 b'DASHBOARD_AGENT_PORT_PREFIX:8933e59a428eeb0ed5fe99f6e569ec837797cdcf65b99fc148be266e': b'[52365, 63340]'}

So clearly, the issue isn't in the data being available in the GCS. I haven't been able to debug this further and it's very much possible that I've totally missed a bunch of things during debugging and am barking up the wrong tree.

I'd really appreciate any help with this!

Versions / Dependencies

Ray - 2.5.0 (though the relevant code pieces seem to be the same in 2.6+?) Python - 3.10.12 Ubuntu - 22.04

Reproduction script

Not sure how to provide a reproducible script for this issue.

Issue Severity

High: It blocks me from completing my task.

VishDev12 commented 1 year ago

Hey @architkulkarni! Hope you don't mind the tag, I noticed your name in the commits and wanted to bring this to your attention 🙂

architkulkarni commented 1 year ago

Hi @VishDev12 thanks for reporting this. Are you using a multi-node cluster? Can you see if the agent is alive on all of the nodes? (You can check dashboard_agent.log on each node)

cc @akshay-anyscale for triage

VishDev12 commented 1 year ago

Yes one of the clusters is a multi-node cluster running on EC2. But I also faced the same issue on a single node cluster (1 EC2 head node).

On the multi-node cluster, I see a dashboard_agent.log on all 3 nodes.

architkulkarni commented 1 year ago

I see, it's probably best to narrow it down to the single node case. There's nothing in the logs to indicate that the agent died?

VishDev12 commented 1 year ago

No nothing at all. Every API other than submit, stop, and delete job works. And as noted in my original comment, the internal job_agent APIs work for submit, stop, and delete. Is it possible that the .agents being empty is the issue? Is that likely even though the GCS has the values necessary as listed above?