ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.51k stars 5.69k forks source link

[core] subprocess leaks if Ray worker crashes #42861

Closed rynewang closed 8 months ago

rynewang commented 8 months ago

What happened + What you expected to happen

If a Ray Worker process spawns another subprocess, then the worker dies, it tries to kill those subprocesses on graceful exit process. However if the worker is sigkill'd it does not have a chance to clean up and the subprocess leaks.

Versions / Dependencies

master

Reproduction script

import ray

ray.init()

import multiprocessing
import time
import psutil
from pprint import pprint
import os

def sleep_forever():
    while True:
        time.sleep(10000)

def get_process_info(pid):
    try:
        process = psutil.Process(pid)
        return {
            "PID": process.pid,
            "Name": process.name(),
            "Status": process.status(),
            "CPU Times": process.cpu_times(),
            "Memory Info": process.memory_info(),
        }
    except psutil.NoSuchProcess:
        return f"No process found with PID: {pid}"
    except Exception as e:
        return f"Error: {e}"

@ray.remote
class BedMaker:
    def make_sleeper(self):
        p = multiprocessing.Process(target=sleep_forever)
        p.start()
        return p.pid
    def my_pid(self):
        return os.getpid()

def demo1():
    print("----------- demo1: ray.kill can kill subprocesses. ----------")
    b = BedMaker.remote()
    pid = ray.get(b.make_sleeper.remote())

    # ray.kill can kill subprocesses.
    pprint(get_process_info(pid)) # shows the process
    ray.kill(b)
    time.sleep(1)
    pprint(get_process_info(pid)) # shows not found

def demo2():
    print("----------- demo2: sigkill'd actor can't kill subprocesses. ----------")
    # sigkill'd actor can't kill subprocesses
    b = BedMaker.remote()
    pid = ray.get(b.make_sleeper.remote())
    actor_pid = ray.get( b.my_pid.remote())

    pprint(get_process_info(pid)) # shows the process
    psutil.Process(actor_pid).kill() # sigkill
    time.sleep(1)
    pprint(get_process_info(pid)) # shows the process

demo1()
demo2()

result:

----------- demo1: ray.kill can kill subprocesses. ----------
{'CPU Times': pcputimes(user=0.0, system=0.0, children_user=0.0, children_system=0.0, iowait=0.0),
 'Memory Info': pmem(rss=44326912, vms=42548969472, shared=2883584, text=2043904, lib=0, data=509145088, dirty=0),
 'Name': 'ray::BedMaker.make_sleeper',
 'PID': 7576,
 'Status': 'sleeping'}
'No process found with PID: 7576'
----------- demo2: sigkill'd actor can't kill subprocesses. ----------
{'CPU Times': pcputimes(user=0.0, system=0.0, children_user=0.0, children_system=0.0, iowait=0.0),
 'Memory Info': pmem(rss=44318720, vms=42548936704, shared=2883584, text=2043904, lib=0, data=509112320, dirty=0),
 'Name': 'ray::BedMaker.make_sleeper',
 'PID': 7685,
 'Status': 'sleeping'}
{'CPU Times': pcputimes(user=0.0, system=0.0, children_user=0.0, children_system=0.0, iowait=0.0),
 'Memory Info': pmem(rss=44318720, vms=42548936704, shared=2883584, text=2043904, lib=0, data=509112320, dirty=0),
 'Name': 'ray::BedMaker.make_sleeper',
 'PID': 7685,
 'Status': 'sleeping'}

Issue Severity

Medium: It is a significant difficulty but I can work around it.

rynewang commented 8 months ago

One solution in Linux I can think of, is to mark raylet as PR_SET_CHILD_SUBREAPER (see https://man7.org/linux/man-pages/man2/prctl.2.html). This way, if a recursive subprocess (e.g. core_worker) dies, all the orphaned child subprocesses are now reparented to raylet. And we need to handle SIGCHLD:

  1. do something for sigchld
    1. use signalfd, and poll it via the raylet's asio loop (or if this is too complex, make a dedicated thread)
    2. or, use sigaction
  2. if there's sigchld:
    1. waitpid
    2. list all raylet's child processes. if there are unrecognized ones, kill them

Problem: this is not portable. macos does not have PR_SET_CHILD_SUBREAPER, we may investigate kqueue, but maybe we don't care either (lol). windows has a JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE, but we may care even less (lol).

@rkooo567 wdyt

rkooo567 commented 8 months ago

@rynewang I think that's the exactly the idea Cade brought up iirc https://anyscaleteam.slack.com/archives/G015EEPTEMN/p1699466934904329?thread_ts=1699445284.501769&cid=G015EEPTEMN

I think we can only handle this in Linux as an advanced feature if other options are complicated. Btw, this is duplicate of https://github.com/ray-project/ray/issues/26118

rkooo567 commented 8 months ago

Duplicate: https://github.com/ray-project/ray/issues/26118

rkooo567 commented 8 months ago

Let's use tests here for unit test for your PR @rynewang