Open VishDev12 opened 1 year ago
Hey @architkulkarni! Hope you don't mind the tag, I noticed your name in the commits and wanted to bring this to your attention 🙂
Hi @VishDev12 thanks for reporting this. Are you using a multi-node cluster? Can you see if the agent is alive on all of the nodes? (You can check dashboard_agent.log
on each node)
cc @akshay-anyscale for triage
Yes one of the clusters is a multi-node cluster running on EC2. But I also faced the same issue on a single node cluster (1 EC2 head node).
On the multi-node cluster, I see a dashboard_agent.log on all 3 nodes.
I see, it's probably best to narrow it down to the single node case. There's nothing in the logs to indicate that the agent died?
No nothing at all. Every API other than submit, stop, and delete job works. And as noted in my original comment, the internal job_agent APIs work for submit, stop, and delete. Is it possible that the .agents being empty is the issue? Is that likely even though the GCS has the values necessary as listed above?
What happened + What you expected to happen
All job operations work as expected when the cluster is launched and for a period of time (not deterministic) after that.
When I list jobs through the CLI, SDK, or the Dashboard API, it continues to work fine forever. But after that non-deterministic period of time, when submitting, stopping, or deleting a job, it times out roughly after 5 minutes with this error for submit job and a generic exception for stop and delete:
No available agent to submit job, please try again later.
The source of that error is job_head.py in these three methods: submit_job, stop_job, and delete_job. And each of them share this snippet in the try block:
followed by, correspondingly:
resp = await job_agent_client.submit_job_internal(submit_request)
resp = await job_agent_client.stop_job_internal(job.submission_id)
resp = await job_agent_client.delete_job_internal(job.submission_id)
To narrow things down, I tried the above internal functions which seem to be accessing this API route:
/api/job_agent/jobs/
. And this worked perfectly fine. I was able to submit a new job, stop it, and delete a job.This narrowed down the issue to
choose_agent
causing the timeout, where this loop is the most likely culprit I think:https://github.com/ray-project/ray/blob/586c376e0769082cb5cfa1333e8264a5fa6b73ec/dashboard/modules/job/job_head.py#L185-L194
Here, DataOrganizer's
get_all_agent_infos
iterates over theDataSource
singleton dictionary and returns the information. So the only way for the flow to be stuck in that loop is ifget_all_agent_infos()
returns an empty dictionary, meaningagent_infos
in the while loop is None, thus continuing the loop forever until the request times out after 5 minutes.For the returned dictionary to be empty,
DataSource.agents
has to be empty, but not always empty, since it worked fine when the cluster launched, so some process in between has to be emptying it out.And searching across the code-base, I noticed only one place where DataSource.agents could be manipulated.
https://github.com/ray-project/ray/blob/586c376e0769082cb5cfa1333e8264a5fa6b73ec/dashboard/modules/node/node_head.py#L181-L201
To check this, I connected to the GCS with:
So clearly, the issue isn't in the data being available in the GCS. I haven't been able to debug this further and it's very much possible that I've totally missed a bunch of things during debugging and am barking up the wrong tree.
I'd really appreciate any help with this!
Versions / Dependencies
Ray - 2.5.0 (though the relevant code pieces seem to be the same in 2.6+?) Python - 3.10.12 Ubuntu - 22.04
Reproduction script
Not sure how to provide a reproducible script for this issue.
Issue Severity
High: It blocks me from completing my task.