ros2 / launch

Tools for launching multiple processes and for writing tests involving multiple processes.
Apache License 2.0
126 stars 141 forks source link

Kill dangling subprocesses #632

Open ivanpauno opened 2 years ago

ivanpauno commented 2 years ago

Fixes https://github.com/ros2/launch/issues/545.

I used psutil to figure out children of a process recursively. It's an easy way to handle this issue platform independently.

For posix OSs, we could send a signal to the process group, but for that we should create a new process group when launching a process, which I'm not sure if it's the best ideal.

hidmic commented 2 years ago

@ivanpauno before commenting anything, why explicitly list all processes instead of using process groups (and whatever Windows has as an equivalent)?

ivanpauno commented 2 years ago

@ivanpauno before commenting anything, why explicitly list all processes instead of using process groups (and whatever Windows has as an equivalent)?

IIUC there's no equivalent to process groups in windows. We could use them in posix-like OSs though

(edit) I have to double check this though

ivanpauno commented 2 years ago

@ivanpauno before commenting anything, why explicitly list all processes instead of using process groups (and whatever Windows has as an equivalent)?

I have a few reasons:

Maybe we shouldn't have this feature at all, and only log a warning if a "dangling" subprocess is detected.

ivanpauno commented 2 years ago

Where it uses psutil.wait_procs?

It would be nice to notify the user in the launch output if the child process still did not exit after some period of time.

I don't think we can block when executing an action, so I would need another timer to check if the processes were killed. Anyway, I'm sending SIGKILL, which cannot be ignored.

Furthermore, does this timer always wait until it expires and then check the subprocesses, or does it only do this if the parent process has to be sent SIGTERM/SIGKILL?

It always sends the signals to all subprocesses, event if SIGTERM/SIGKILL was not needed.

Also, what happens when all the subprocess exit cleanly and quickly? Does this timer still wait and then check them?

Yes, and "process does not exist" errors are ignored when trying to kill the subprocesses that were previously detected in the tree for this reason. I'm counting with the OS not reassinging the same pid shortly after.

It feels like we're missing something here where we await these processes exiting and only send them signals if they don't exit.

Do you mean we should kill subprocesses going "level by level"? i.e. first kill the process launched with ExecuteLocal. After waiting that process was killed, kills its subprocesses that are still running. Is that your idea? Should this be repeated recursively or only for one level?

wjwwood commented 2 years ago

I don't think we can block when executing an action, so I would need another timer to check if the processes were killed.

We can await the processes to exit, either with a thread that sets a future, or if possible using existing asyncio subprocess stuff.

Anyway, I'm sending SIGKILL, which cannot be ignored.

It cannot be caught, but it doesn't mean that it will definitely exit, or how long that will take. I've definitely had some processes fail to exit on kill -9 before, but usually the system is borked then. But the point would be to notify the user (so they don't have to go crawling to ps to see what stayed around), not to do anything else, since there's nothing else we could do after SIGKILL.

It always sends the signals to all subprocesses, event if SIGTERM/SIGKILL was not needed.

Yes, and "process does not exist" errors are ignored when trying to kill the subprocesses that were previously detected in the tree for this reason.

My point was more about the waiting. If the default timeout of sigkill_subprocesses_timeout is like 5 seconds, but all the processes exit immediately, do we wait 5 seconds, send a bunch of signals we don't need to send, then finally exit?

I'm counting with the OS not reassinging the same pid shortly after.

Is that a safe assumption?

Do you mean we should kill subprocesses going "level by level"? i.e. first kill the process launched with ExecuteLocal. After waiting that process was killed, kills its subprocesses that are still running. Is that your idea?

No, that's not what I'm suggesting.

Instead I was thinking something like this:

This process is good in my opinion because it:

Obviously the escalation process for the main process and the child processes are the same, so maybe we can generalize that, so it can be used in ExecuteLocal as well as with any individual process we have a pid for.

What I outlined is a bit more work to implement, but I also think it's a lot more thorough and is likely to save us (and our users) some time in the future debugging these kind of issues.

wjwwood commented 2 years ago

Do you mean we should kill subprocesses going "level by level"?

Another comment about this.

I wasn't thinking that, I was still thinking doing it in a batch operation, but I also don't think doing it layer by layer is a bad idea, though it might be really annoying if it takes a long time.

The reason I like it as an idea is that I want to give the called processes every chance to behave, and doing a shutdown escalation level by level (top to bottom) is the best way to do that.

I could see this as something of a configuration. If we decide to implement what I outlined above, then making it possible to do it step by step instead of in batch should be easy-ish to do. We just need to keep that in mind when working on it.

ivanpauno commented 2 years ago

Is that a safe assumption?

I'm not sure. If that's not a safe assumption, then we cannot do anything. The best we can do is to notify the user "it looks like a launched process left some subprocesses alive after being killed: {pids}".

starts tracking the child process statuses before trying to terminate them, so (maybe) you don't have to worry about the OS reassigning them

But you cannot really do this:

ivanpauno commented 2 years ago

The alternative is to create a new group id when launching a process, and then send a signal to the created group. The problem is that we have to handle the posix and windows case separately, but maybe we can find something it works for both.

wjwwood commented 2 years ago

OSs don't provide a way to do this AFAIK. It's possible for a process you launched directly, but it's not possible for recursive subprocesses (I think waitpid is limited to direct children in linux).

Then how does https://psutil.readthedocs.io/en/latest/index.html?highlight=children#psutil.wait_procs work? In their example they enumerate the children and then wait for them to exit.

ivanpauno commented 2 years ago

Then how does https://psutil.readthedocs.io/en/latest/index.html?highlight=children#psutil.wait_procs work? In their example they enumerate the children and then wait for them to exit.

polling https://github.com/giampaolo/psutil/blob/379598312f60bf414afa8bf549f7f26af9e578ea/psutil/__init__.py#L1532-L1550 https://github.com/giampaolo/psutil/blob/57ed46de8a988e7ab26279c2a967fb15b05397a3/psutil/_psposix.py#L120-L125

wjwwood commented 2 years ago

Ok, I see, so no actual blocking is realistic, and it does it one at a time, so you may "miss" the exit of one process while waiting on another.

So maybe that doesn't help with the race between process exit and pid reuse, but I still think waiting to see if they exit is a decent idea. There's nothing worse than having to go back to ps or something to figure out what was left behind, especially if launch could just tell us.

And my proposed series of steps have some other advantages, even if this one doesn't pan out.

wjwwood commented 2 years ago

The alternative is to create a new group id when launching a process, and then send a signal to the created group.

What benefit does that give us? (curious)

ivanpauno commented 2 years ago

Ok, I see, so no actual blocking is realistic, and it does it one at a time, so you may "miss" the exit of one process while waiting on another.

Just in case, polling is not only used to check the status of more than one process, it's also used to "wait for a process to exit" if the process is not a child. Waiting for a process that's not a child is just a loop with a "sleep" between tries, checking if a process with pid==x exists (see here).

but I still think waiting to see if they exit is a decent idea

Sounds good. But given the API we have available, I think it's a good idea to use a launch timer for that. Is that okay? If another way is preferred, please specify which.

And my proposed series of steps have some other advantages, even if this one doesn't pan out.

Do you mean scalating SIGINT -> SIGTERM -> SIGKILL -> "Log if subprocess(es) are still alive" for subprocesses as well? That sounds fine to me, I can add that.

What benefit does that give us? (curious)

You send only one signal to the group, though the part to monitor if the processes exited or not doesn't change. So it doesn't seem quite benefitial honestly.

ivanpauno commented 2 years ago

And my proposed series of steps have some other advantages, even if this one doesn't pan out.

Do you mean scalating SIGINT -> SIGTERM -> SIGKILL -> "Log if subprocess(es) are still alive" for subprocesses as well? That sounds fine to me, I can add that.

@wjwwood could you confirm this?

ivanpauno commented 2 years ago

@wjwwood friendly ping

wjwwood commented 2 years ago

But given the API we have available, I think it's a good idea to use a launch timer for that. Is that okay? If another way is preferred, please specify which.

So, I was actually thinking we do something like what psutils does (perhaps using it), but in a thread or maybe as an async coroutine/task. Basically create a list of the pid you're watching, start timers to escalate their signals, then a thread/coroutine-task that iterates over all the pid that we're watching, and for each that have exited cancel the timers and remove them from the list, and then once you iterate, sleep for a fixed short period and then poll again until the list is empty. If, after doing SIGKILL and some period has passed, log it and then remove it from the list of pid to watch.

Do you mean scalating SIGINT -> SIGTERM -> SIGKILL -> "Log if subprocess(es) are still alive" for subprocesses as well?

Correct.

ivanpauno commented 2 years ago

So, I was actually thinking we do something like what psutils does (perhaps using it), but in a thread or maybe as an async coroutine/task. Basically create a list of the pid you're watching, start timers to escalate their signals, then a thread/coroutine-task that iterates over all the pid that we're watching, and for each that have exited cancel the timers and remove them from the list, and then once you iterate, sleep for a fixed short period and then poll again until the list is empty. If, after doing SIGKILL and some period has passed, log it and then remove it from the list of pid to watch.

Please, take a look to https://github.com/ros2/launch/pull/632/commits/e326c0dfee10ad1beaa7c12db6d97eb15da57b74

ciandonovan commented 2 years ago

Has this not been fixed by https://github.com/ros2/launch/pull/475? When running a ROS2 Launch in "non-interactive" mode, SIGINTs are successfully propagated to child nodes, cleanly terminating the process tree without needing Ctrl-C in a controlling terminal. SIGTERMs still aren't handled at all as I mentioned here though https://github.com/ros2/launch/issues/666

ivanpauno commented 2 years ago

Has this not been fixed by https://github.com/ros2/launch/pull/475? When running a ROS2 Launch in "non-interactive" mode, SIGINTs are successfully propagated to child nodes

This is not about killing subprocesses of launch, but about killing subprocesses of processes that launch spawns. Ideally, this would be a responsability of the launch subprocess in question, and not of launch. But there're easy ways to run into the problem (e.g. shell=True), that's why this PR. See discussion in https://github.com/ros2/launch/issues/545.

I'm not sure if what we're doing here should be enabled by default or not, but having it as an option is really nice to have.

ivanpauno commented 2 years ago

@wjwwood friendly ping

ivanpauno commented 1 year ago

I don't know if it is feasible or worth the effort to create tests for this behavior, but it would be good I think to show it is working and guard against regressions.

Working on adding tests now

ivanpauno commented 1 year ago

@wjwwood test case was added in https://github.com/ros2/launch/pull/632/commits/5b43cdcc50108a60fbbb4b56d3eb9d88b91c52f7, I also needed to rebase to solve conclicts with rolling

methylDragon commented 1 year ago

@ivanpauno just some flake8 linting issues

methylDragon commented 1 year ago
ivanpauno commented 1 year ago

mmm, I have to double check those failures, they seem related

ivanpauno commented 1 year ago
ivanpauno commented 1 year ago
ivanpauno commented 1 year ago
ivanpauno commented 1 year ago

This is making windows CI hang for some reason .... I will try to set up a windows vm and see what's going on, but I'm not sure if I will have the time to complete that

russkel commented 1 year ago

My use case is a process spawns some children and then exits, and I do not want launch to exit until the child processes have exited.

I have implemented this in https://github.com/Greenroom-Robotics/launch_ext/commit/850a2f4b59e7bb49857f8c5fec06112aa8e239cc#diff-7baf6e854cc3c937eaed0b127161c5f82cf86d8a8eaef0038171216a817a0c62R194-R222

My technique was to use the stdin/stdout/stderr inodes and if any shared the same inodes with the parent process they would be considered children to be waited on. I am not sure if this way is ideal but it does seem to work.

mwcondino commented 4 months ago

Howdy @ivanpauno - any chance this could still be merged? I've encountered this issue recently, and have a workaround in place, but it would be ideal to handle the issue directly in ExecuteLocal with this fix :smile: