ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.04k stars 5.59k forks source link

[Ray Tune] [ JobSubmissionClient] Error fetching job logs using client.get_job_logs(job_id) with JobSubmissionClient #45518

Open Manish-2004 opened 3 months ago

Manish-2004 commented 3 months ago

What happened + What you expected to happen

I am trying to run this https://docs.ray.io/en/latest/tune/examples/tune-xgboost.html#id8 example using JobSubmissionClient to run the script, the example is running fine but getting this below error while fetching the job logs using client.get_job_logs(job_id)

--------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[3], line 25
     23     else:
     24         break
---> 25 client.get_job_logs(job_id) 

File /opt/conda/envs/ray/lib/python3.11/site-packages/ray/dashboard/modules/job/sdk.py:453, in JobSubmissionClient.get_job_logs(self, job_id)
    451     return JobLogsResponse(**r.json()).logs
    452 else:
--> 453     self._raise_error(r)

File /opt/conda/envs/ray/lib/python3.11/site-packages/ray/dashboard/modules/dashboard_sdk.py:283, in SubmissionClient._raise_error(self, r)
    282 def _raise_error(self, r: "requests.Response"):
--> 283     raise RuntimeError(
    284         f"Request failed with status code {r.status_code}: {r.text}."
    285     )

RuntimeError: Request failed with status code 500: Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/dashboard/modules/job/job_head.py", line 462, in get_job_logs
    resp = await job_agent_client.get_job_logs_internal(job.submission_id)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/dashboard/modules/job/job_head.py", line 107, in get_job_logs_internal
    async with self._session.get(
  File "/home/ray/anaconda3/lib/python3.11/site-packages/aiohttp/client.py", line 1167, in __aenter__
    self._resp = await self._coro
                 ^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/aiohttp/client.py", line 586, in _request
    await resp.start(conn)
  File "/home/ray/anaconda3/lib/python3.11/site-packages/aiohttp/client_reqrep.py", line 905, in start
    message, payload = await protocol.read()  # type: ignore[union-attr]
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/aiohttp/streams.py", line 616, in read
    await self._waiter
aiohttp.client_exceptions.ServerDisconnectedError: Server disconnected
.

Versions / Dependencies

Kuberay Operator v1.1.1 Ray v2.21.0

Reproduction script

import ray from ray.job_submission import JobSubmissionClient import time

Ray cluster information for connection

ray_head_ip = "kuberay-head-svc.kuberay.svc.cluster.local" ray_head_port = 8265 ray_address = f"http://{ray_head_ip}:{ray_head_port}" client = JobSubmissionClient(ray_address)

Submit Ray job using JobSubmissionClient

job_id = client.submit_job( entrypoint="python xgb.py", runtime_env={ "working_dir": "./", }, entrypoint_num_cpus=3 )

print(f"Ray job submitted with job_id: {job_id}")

Waiting for Ray to finish the job and print the result

while True: status = client.get_job_status(job_id) if status in [ray.job_submission.JobStatus.RUNNING, ray.job_submission.JobStatus.PENDING]: time.sleep(5) else: break client.get_job_logs(job_id)

Issue Severity

High: It blocks me from completing my task.

sercanCyberVision commented 3 months ago

Hello @anyscalesam, Is there any update on this issue?

sercanCyberVision commented 2 months ago

We submit all our actions to Ray cluster with JobSubmissionClient as described. We face with the issue only with Tune. Also, please note that the execution of the jobs are still good, the issue caused when we try to get the logs