ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.95k stars 5.58k forks source link

[Jobs] Setting `RAY_LOG_TO_STDERR` results in empty job logs #24886

Open spolcyn opened 2 years ago

spolcyn commented 2 years ago

What happened + What you expected to happen

  1. When RAY_LOG_TO_STDERR is set, a job doesn't receive the logs from actors it creates.
  2. We'd expect the logs to look the same whether RAY_LOG_TO_STDERR is set or unset.

Note: "empty" here means the job driver log lacks all the previous output from the actors created, and it only has the internal python-core-driver cpp logs.

Potentially related to https://github.com/ray-project/ray/issues/23564

Versions / Dependencies

Ray 1.12.0 Python 3.7.12 Ubuntu 20.04

Reproduction script

Instructions:

  1. Put test_job.py and run.py in the same folder
  2. Run python test_job.py
  3. Wait for completion, open repro-logs.txt, find search for me in logs string present
  4. Run RAY_LOG_TO_STDERR=1 python test_job.py
  5. Wait for completion, open repro-logs.txt do not find search for me in logs string present
# test_job.py
import time

import ray
from ray.job_submission import JobStatus, JobSubmissionClient

ray.init()
client = JobSubmissionClient("localhost:8265")
job_id: str = client.submit_job(entrypoint="python run.py")
while True:
    status: JobStatus = client.get_job_status(job_id)
    print("Status: ", status)

    if status in {JobStatus.STOPPED, JobStatus.FAILED, JobStatus.SUCCEEDED}:
        print("Done")
        logs = client.get_job_logs(job_id)
        with open("repro-logs.txt", mode="w") as f:
            f.write(logs)
        break

    time.sleep(1)
# run.py
import ray
import time

ray.init(address="auto")

@ray.remote
class A:
    def f(self):
        for i in range(3):
            print("search for me in logs:", i)
            time.sleep(1)

handle = A.remote()
ray.get(handle.f.remote())

Issue Severity

Medium: It is a significant difficulty but I can work around it.

DmitriGekhtman commented 2 years ago

I think @clarkzinzow @rkooo567 are the RAY_LOG_TO_STDERR experts. I think @edoakes is on-call for jobs stuff.

rkooo567 commented 2 years ago

Hmm this is pretty tricky. It is because how we display logs to the driver is by reading a log file. If RAY_LOG_TO_STDERR is set, we don't write anything to the log file, so it won't be able to get logs from the driver. I think there are 2 ways to fix it, but both are not very straightforward nor clean.

  1. If RAY_LOG_TO_STDERR is set, the component directly publishes the batched log message to the driver. This can cause some performance regression when RAY_LOG_TO_STDERR is set because it will cause the higher load to the driver + workers.
  2. We redirect stdout to both stdout AND file. This seems to be a better solution, but it seems pretty difficult to have multiple sinks when redirecting stdout using Popen which is the way we are doing..
stale[bot] commented 1 year ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.

DmitriGekhtman commented 1 year ago

Keep open.