skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.81k stars 513 forks source link

[UX] Unnecessary logs from ray #4300

Open Michaelvll opened 1 week ago

Michaelvll commented 1 week ago

I am seeing unnecessary logs from ray when trying to run 1000 jobs on controller (with 64 cores) in #4281.

sky jobs logs --controller 1

(job-64, pid=3812) I 11-08 08:20:23 utils.py:103] ==================================
(job-64, pid=3812) I 11-08 08:20:44 utils.py:94] === Checking the job status... ===
(job-64, pid=3812) I 11-08 08:20:44 utils.py:100] Job status: JobStatus.RUNNING
(job-64, pid=3812) I 11-08 08:20:44 utils.py:103] ==================================
(raylet) WARNING: 960 PYTHON worker processes have been started on node: cb2dfce547e21659e3d726c1284ba18926ca9ddca8454111120aa038 with address: 172.31.14.176. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
(job-64, pid=3812) I 11-08 08:21:05 utils.py:94] === Checking the job status... ===
(job-64, pid=3812) I 11-08 08:21:05 utils.py:100] Job status: JobStatus.RUNNING
(job-64, pid=3812) I 11-08 08:21:05 utils.py:103] ==================================
(job-64, pid=3812) I 11-08 08:21:25 utils.py:94] === Checking the job status... ===
(job-64, pid=3812) I 11-08 08:21:26 utils.py:100] Job status: JobStatus.RUNNING
(job-64, pid=3812) I 11-08 08:21:26 utils.py:103] ==================================

This is likely due to #4247

Version & Commit info:

cg505 commented 1 week ago

Wait for #4318 and see if it is better after.