subprocesses of kernels are not killed by restart

jankatins commented 8 years ago

Restarting a kernel only kills the direct child, but not any opened subprocesses of that child.

This is a problem if the kernel spawns multiple childprocess like the R kernel (on windows: R - cmd - R - cmd - Rterm) or if the python kernel opens a subprocess which sleeps.

Issue on the irkernel repo: https://github.com/IRkernel/IRkernel/issues/226

IMO the kernel manager should kill the complete process group and not only the process which is directly spawned by the kernel manager.

I use restart kernel when I want resources back which I lost in the kernel (due to the stupid commands I executed) and currently I have a 4 processes running (cmd - R - cmd - Rterm) where I killed the kernel 15min ago (and it's still 25% CPU and 1GB Ram). So if I have to wait for the actual process which does the work to finish, the restart command is kind of useless :-(

reproducible example (at least on windows): put it into a cell, execute it and then restart the kernel via the notebook UI.

import platform
import subprocess
if platform.system() == 'Windows':
    CREATE_NO_WINDOW = 0x08000000
    subprocess_creation_flags = CREATE_NO_WINDOW
else:
    # Apparently None won't work here
    subprocess_creation_flags = 0
exec_line = u"""python -c "import time;time.sleep( 50000 )" """
p = subprocess.Popen(exec_line, shell=False,
                     stdout=subprocess.PIPE, stderr=subprocess.PIPE,
                     creationflags=subprocess_creation_flags)
p.communicate()

takluyver commented 8 years ago

I suspect that makes sense on Unix, but killpg is only available on Unix, so that's not going to help you on Windows.

The question you linked has a couple of answers that would apply on Windows: one with psutil, and one that involves launching TASKKILL in another subprocess.

jankatins commented 8 years ago

ipyparallel has some better way to handle signals on windows: https://github.com/ipython/ipyparallel/blob/d35b4901adde6e651bb6ffc5c8aa9df0d20f1a73/ipyparallel/apps/launcher.py#L282-L288

Would you take a patch which uses check_output(['taskkill', '-pid', str(self.kernel.pid), '-t', '-f']) on windows instead self.kernel.kill() in _kill_kernel() (source)?

takluyver commented 8 years ago

In principle I'd be OK with that. Though a note on this page is a bit concerning: "Home editions of Windows do not have TASKKILL, useTSKILL instead."

Maybe that's outdated? Do we have a good way of checking?

jankatins commented 8 years ago

Not on this side: only pro systems here :-/

jankatins commented 8 years ago

BTW: can someone on linux check if the problem exists there, too? E.g. run the cell in the initial message and restart the kernel and see if the subprocess is also gone...?

takluyver commented 8 years ago

Yes, with a bit of adapting (turning command string into command list), I can run that (on Linux), and when I restart the kernel, the subprocess is orphaned and continues running.

However, thinking about this some more, I'm not sure that the proposed fixes will help. When we restart a kernel, we first ask it nicely to shut down, and only if that doesn't work do we forcibly kill it. So I guess it should be the kernel's own responsibility to deal with child processes in the first instance.

jankatins commented 8 years ago

Ok, I did a bit of googling turned up that taskkill is not available in win7 home. I've not found out about Win8 and Win10. So the workaround is to taskkill, check if the command was found and if not to add a kernel.kill() to use the old subprocess way...

RE "subprocess cleanup": that's true as long as the kernel can answer, but sometimes that's not the case: e.g. if the user starts a http server for displaying a plot and then the plotting itself takes too long, so that the kernel is unreachable and can't answer the "shutdown kernel" message. Or the irkernel call chain mess where the kernel isn't responding as long as it executes something...

Another way might be using psutil: http://stackoverflow.com/questions/4789837/how-to-terminate-a-python-subprocess-launched-with-shell-true/25134985#25134985 https://github.com/giampaolo/psutil/blob/1b71ca531ba377e8c71d2c581debfce7eae35a40/psutil/tests/__init__.py#L326

-> If psutil is importable, use that method to kill all children, otherwise the old one?

jankatins commented 8 years ago

The "kill all children" might also becomes a necessity if the future kernel nanny becomes unresponsive itself (not sure if this might ever happen...), e.g. how to kill the chain nanny - kernel in that case.

takluyver commented 8 years ago

Part of the idea of the kernel nanny is that it should always be responsive, because it's not running user code, or any significant tasks. So the frontend would ask the nanny to shut down, and the nanny would deal with asking the kernel nicely and then forcefully killing it if necessary.

However, since the normal case is a kernel that is responsive, I think the first thing is to work out how kernels should shut themselves down and terminate child processes, before we worry too much about what to do when they're not responsive.

jankatins commented 8 years ago

I think the first thing is to work out how kernels should shut themselves down and terminate child processes, before we worry too much about what to do when they're not responsive.

Is that for this issue or the kernel nanny proposal? If the first: that's over my head :-/

-> Should I still submit a PR for the taskkill (or even the psutil) way to kill the kernel and all it's children?

takluyver commented 8 years ago

That's for this issue. I don't think the kernel nanny thing changes this in any significant way.

I'd rather leave this in a consistent state, i.e. as it is now, than introduce a bunch of extra complexity that fixes it in some situations but not others. Especially if we don't know how to fix it in the most common case, where the kernel is responding.

jankatins commented 8 years ago

So, here is my argument for having such a "kernel massacre" function included in both the nanny and the juypter_client code. I think there are at least two use cases here for such a function:

the initial case of the non responding R kernel which should get restarted / killed (e.g. endless loop). In this case, the r kernel can handle the children, because a) the R kernel is busy executing the non responding code and doesn't see the shutdown message and b) the kernel is not the parent but the child in the call chain: notebook-server - r.exe - cmd - rterm.exe (probably longer when not directly calling the 64bit version) -> the notebook kills r.exe, but rterm.exe is the actual r kernel.
the case which got simulated by the above code: the kernel opens a subprocess but the users doesn't kill the subprocess before killing the kernel. This asks for a solution in the kernel, but this means that all kernels must implement a way to kill their own children (I opened this for the R kernel: https://github.com/IRkernel/IRkernel/issues/307).

But IMO both cases are a nice feature of the kernel nanny (and the code in jupyter_client which would have the same role in case there is no nanny): cleaning up after the user:

receive the shutdown request from the server (or being the server)
register the children of the kernel process
ask the kernel to shutdown itself
wait a few seconds
cleanup: kill every kernel/children which are still around

takluyver commented 8 years ago

psutil can find all children of a process, though killpg seems like a better way to do it on Unix. Maybe we just need to depend on psutil on Windows.

mlucool commented 2 years ago

Hi,

Adding another simple example of this issue:

Run the following code then restart your notebook. Then run it again and you'll see the children piling up (this assumes you have no other things sleeping for the demo to work).

import getpass, os, psutil, signal, subprocess

def find_interesting_children():
    user = getpass.getuser()
    ret = []
    for pid in list(psutil.pids()):
        try:
            proc = psutil.Process(pid)
        except Exception:
            # Ignore now-dead processes.
            continue
        if proc.username() != user:
            continue
        if proc.name() != 'sleep':
            continue
        ret.append(pid)
    return ret

def preexec_function():
    signal.signal(signal.SIGINT, signal.SIG_IGN)

print(find_interesting_children())
child = subprocess.Popen(['sleep', '180d'], preexec_fn=preexec_function)
print(find_interesting_children())

If you want to cleanup run later:

for pid in find_interesting_children():
    os.kill(pid, signal.SIGTERM)

On a high level, this stems from the choice in this toy example to ignore SIGINT. Sometimes a subprocess may do this and it's out of your control so restarting kernels leaves behind a mess (if you don't ignore SIGINT things work as expected). Can we add logic to handle this variant of process cleanup?

kevin-bates commented 2 years ago

Hi @mlucool, Thanks for the test case!

I've spent the better part of today looking into this and it's quite interesting (IMHO). There are three areas of interest, two of which are still a bit of mystery to me.

First, I think the LocalProvisioner has a regression issue in that it's issuing Popen kill() and terminate() commands directly, rather than preferring to issue the respective signals (SIGKILL, SIGTERM) against the process group - as was the case in 6.x. This issue, though, does not manifest itself in the current incantation of things (more below).
It's a mystery to me why the ignore of SIGINT has an impact when ignoring any other signal (or the default SIGINT handler) does not!
There are two forms of shutdown that can happen, "immediate" and "graceful". The default (via the now parameter) is "graceful", meaning that a shutdown_request message is sent and, unless there are issues, nothing else occurs (if the message is detected as not having worked [via polling], then SIGTERM/SIGKILL are sent). "Immediate" shutdown issues a kill() call directly. (The regression mentioned in item 1 is that this should use the signal against the process group, before issuing kill().)

What I don't understand is why handling SIGINT makes a difference here. Seems like it, coupled with the message-based shutdown, is an area that should be looked at (at the kernel level I presume). (Does the kernel's message-based shutdown try to send SIGINT to its process group as its form of cleaning up child processes by any chance? Just trying to explain what is seen.)

One thing we could do relative to restarts is perform shutdown in an immediate fashion (now=True). This will then trigger the signal/process-group form of shutdown (provided the LocalProvisioner is fixed).

This would still leave an issue when just terminating a kernel that has set up its child processes to ignore SIGINT since that uses "graceful" termination.

For now, I'd like to fix the regression issue in LocalProvisioner. Of course, this fix won't be "visible" until immediate shutdowns are more prevalent.

Should we make the shutdown issued from restart_kernel use now=True and bypass the message-based shutdown on restarts? Another approach might be to update the handler in jupyter_server to use now=True when restarting the kernel.

It would be great to hear from someone with some kernel-side experience and how "shutdown-request" with ignored SIGINT comes into play.

mlucool commented 2 years ago

CC @Carreau

I think what's happening is as follows (I am not familiar with the client or kernel code, so this is a guess):

SIGINT is sent to the kernel and it relays it to its children
The kernel is correctly shut down but its children are not
The children are setting their parent now to 1 which jupyter can no longer tell it should be trying to kill

Proposal (maybe this should move to IPykernel/IPython?):

On SIGINT the kernel does the normal SIGINT things but does not exit until all its children exit first.
Because jupyter_client can send SIGTERM too, the kernel should also wait to exit on SIGTERM until all its children exit first.
The kernel shutdown code should be robust to getting a SIGTERM while blocking on children as a result of a prior SIGINT.
jupyter_client will do its usual escalation from SIGINT to SIGTERM to SIGKILL in a way that causes the whole process group to receive each signal.

kevin-bates commented 2 years ago

On SIGINT the kernel does the normal SIGINT things but does not exit until all its children exit first.

The SIGINT issued during shutdown_kernel() is merely an attempt to have any current processing interrupted prior to the process's termination (via "shutdown_request" and, possibly, SIGTERM/SIGKILL). It was (relatively) recently added and is literally the interrupt_kernel() method itself. It is not meant to terminate anything.

I think the parent process should be responsible for the termination of its immediate children (and so on). We have the ability to cheat when the kernel is a local process and send a signal to the process group but (per my previous comment) that only occurs in "fallback" situations when the message-based shutdown request has proven ineffective. (Note: this was one of the primary reasons the interrupt was introduced in the shutdown sequence, so the kernel could respond to "shutdown_request" messages.)

I would agree that when receiving a "shutdown_request" message, the kernel process should do whatever it can to terminate its children. This could probably be something like os.killpg(os.getpgrp(), SIGTERM) (and the equivalent windows code) as its last act.

mlucool commented 2 years ago

FTR, ipykernel 6.9.2 fixes https://github.com/jupyter/jupyter_client/issues/104#issuecomment-1028388087.

alita-moore commented 5 months ago

Is there a way to make jupyter NOT stop sub-processes?

kevin-bates commented 5 months ago

Hi @alita-moore. Life-cycle management of the kernel process (i.e., the sub-preocess) is the responsibility of the kernel provisioner. This gives one the ability to implement whatever kind of process management they'd like beyond what is provided by the local provisioner by default.

jupyter / jupyter_client

subprocesses of kernels are not killed by restart #104