ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.97k stars 5.77k forks source link

[Core] Logs are duplicated if multiple nodes are running on same machine #48642

Open JakkuSakura opened 5 days ago

JakkuSakura commented 5 days ago

What happened + What you expected to happen

I encountered this https://github.com/ray-project/ray/issues/10392 issue when I was experimenting with ray. This issue was closed due to the inability to provide a reproducible example.

Versions / Dependencies

ray[all] 2.38.0 MacOS

Reproduction script

# example.py
import ray

@ray.remote
def foo():
    print('hello')

if __name__ == '__main__':
    ray.init()
    handle = foo.remote()
    ray.get(handle)
RAY_ENABLE_WINDOWS_OR_OSX_CLUSTER=1 ray start --head
RAY_ENABLE_WINDOWS_OR_OSX_CLUSTER=1 ray start --address='192.168.0.196:6379'
python example.py

Output: 24-11-08 13:54:19,817 INFO worker.py:1601 -- Connecting to existing Ray cluster at address: 192.168.0.196:6379... 2024-11-08 13:54:19,831 INFO worker.py:1777 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265 (foo pid=45881) hello (foo pid=45881) hello

Issue Severity

Low: It annoys or frustrates me.

A workaround is at: https://github.com/intel-analytics/BigDL-2.x/pull/2799/files

I mitigated this issue by calling this function after starting worker node. Of course, it has many downsides and it's not the way to go in long term.


def kill_redundant_log_monitors():
    """
    Killing redundant log_monitor.py processes.
    If multiple ray nodes are started on the same machine,
    there will be multiple ray log_monitor.py processes
    monitoring the same log dir. As a result, the logs
    will be replicated multiple times and forwarded to driver.
    See issue https://github.com/ray-project/ray/issues/10392
    """

    import psutil
    import subprocess
    log_monitor_processes = []
    for proc in psutil.process_iter(["name", "cmdline"]):
        try:
            cmdline = subprocess.list2cmdline(proc.cmdline())
        except (psutil.AccessDenied, psutil.NoSuchProcess):
            continue
        is_log_monitor = "log_monitor.py" in cmdline
        if is_log_monitor:
            log_monitor_processes.append(proc)

    if len(log_monitor_processes) > 1:
        for proc in log_monitor_processes[1:]:
            proc.kill()
kevin85421 commented 3 days ago

thank you for reporting the issue!