Closed rynewang closed 8 months ago
One solution in Linux I can think of, is to mark raylet as PR_SET_CHILD_SUBREAPER (see https://man7.org/linux/man-pages/man2/prctl.2.html). This way, if a recursive subprocess (e.g. core_worker) dies, all the orphaned child subprocesses are now reparented to raylet. And we need to handle SIGCHLD:
Problem: this is not portable. macos does not have PR_SET_CHILD_SUBREAPER, we may investigate kqueue, but maybe we don't care either (lol). windows has a JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE, but we may care even less (lol).
@rkooo567 wdyt
@rynewang I think that's the exactly the idea Cade brought up iirc https://anyscaleteam.slack.com/archives/G015EEPTEMN/p1699466934904329?thread_ts=1699445284.501769&cid=G015EEPTEMN
I think we can only handle this in Linux as an advanced feature if other options are complicated. Btw, this is duplicate of https://github.com/ray-project/ray/issues/26118
Let's use tests here for unit test for your PR @rynewang
What happened + What you expected to happen
If a Ray Worker process spawns another subprocess, then the worker dies, it tries to kill those subprocesses on graceful exit process. However if the worker is sigkill'd it does not have a chance to clean up and the subprocess leaks.
Versions / Dependencies
master
Reproduction script
result:
Issue Severity
Medium: It is a significant difficulty but I can work around it.