Closed jankatins closed 2 years ago
I suspect that makes sense on Unix, but killpg
is only available on Unix, so that's not going to help you on Windows.
The question you linked has a couple of answers that would apply on Windows: one with psutil, and one that involves launching TASKKILL
in another subprocess.
ipyparallel has some better way to handle signals on windows: https://github.com/ipython/ipyparallel/blob/d35b4901adde6e651bb6ffc5c8aa9df0d20f1a73/ipyparallel/apps/launcher.py#L282-L288
Would you take a patch which uses check_output(['taskkill', '-pid', str(self.kernel.pid), '-t', '-f'])
on windows instead self.kernel.kill()
in _kill_kernel()
(source)?
In principle I'd be OK with that. Though a note on this page is a bit concerning: "Home editions of Windows do not have TASKKILL, useTSKILL instead."
Maybe that's outdated? Do we have a good way of checking?
Not on this side: only pro systems here :-/
BTW: can someone on linux check if the problem exists there, too? E.g. run the cell in the initial message and restart the kernel and see if the subprocess is also gone...?
Yes, with a bit of adapting (turning command string into command list), I can run that (on Linux), and when I restart the kernel, the subprocess is orphaned and continues running.
However, thinking about this some more, I'm not sure that the proposed fixes will help. When we restart a kernel, we first ask it nicely to shut down, and only if that doesn't work do we forcibly kill it. So I guess it should be the kernel's own responsibility to deal with child processes in the first instance.
Ok, I did a bit of googling turned up that taskkill is not available in win7 home. I've not found out about Win8 and Win10. So the workaround is to taskkill, check if the command was found and if not to add a kernel.kill()
to use the old subprocess way...
RE "subprocess cleanup": that's true as long as the kernel can answer, but sometimes that's not the case: e.g. if the user starts a http server for displaying a plot and then the plotting itself takes too long, so that the kernel is unreachable and can't answer the "shutdown kernel" message. Or the irkernel call chain mess where the kernel isn't responding as long as it executes something...
Another way might be using psutil: http://stackoverflow.com/questions/4789837/how-to-terminate-a-python-subprocess-launched-with-shell-true/25134985#25134985 https://github.com/giampaolo/psutil/blob/1b71ca531ba377e8c71d2c581debfce7eae35a40/psutil/tests/__init__.py#L326
-> If psutil is importable, use that method to kill all children, otherwise the old one?
The "kill all children" might also becomes a necessity if the future kernel nanny becomes unresponsive itself (not sure if this might ever happen...), e.g. how to kill the chain nanny - kernel
in that case.
Part of the idea of the kernel nanny is that it should always be responsive, because it's not running user code, or any significant tasks. So the frontend would ask the nanny to shut down, and the nanny would deal with asking the kernel nicely and then forcefully killing it if necessary.
However, since the normal case is a kernel that is responsive, I think the first thing is to work out how kernels should shut themselves down and terminate child processes, before we worry too much about what to do when they're not responsive.
I think the first thing is to work out how kernels should shut themselves down and terminate child processes, before we worry too much about what to do when they're not responsive.
Is that for this issue or the kernel nanny proposal? If the first: that's over my head :-/
-> Should I still submit a PR for the taskkill (or even the psutil) way to kill the kernel and all it's children?
That's for this issue. I don't think the kernel nanny thing changes this in any significant way.
I'd rather leave this in a consistent state, i.e. as it is now, than introduce a bunch of extra complexity that fixes it in some situations but not others. Especially if we don't know how to fix it in the most common case, where the kernel is responding.
So, here is my argument for having such a "kernel massacre" function included in both the nanny and the juypter_client code. I think there are at least two use cases here for such a function:
notebook-server - r.exe - cmd - rterm.exe
(probably longer when not directly calling the 64bit version) -> the notebook kills r.exe
, but rterm.exe
is the actual r kernel.But IMO both cases are a nice feature of the kernel nanny (and the code in jupyter_client which would have the same role in case there is no nanny): cleaning up after the user:
psutil can find all children of a process, though killpg
seems like a better way to do it on Unix. Maybe we just need to depend on psutil on Windows.
Hi,
Adding another simple example of this issue:
Run the following code then restart your notebook. Then run it again and you'll see the children piling up (this assumes you have no other things sleeping for the demo to work).
import getpass, os, psutil, signal, subprocess
def find_interesting_children():
user = getpass.getuser()
ret = []
for pid in list(psutil.pids()):
try:
proc = psutil.Process(pid)
except Exception:
# Ignore now-dead processes.
continue
if proc.username() != user:
continue
if proc.name() != 'sleep':
continue
ret.append(pid)
return ret
def preexec_function():
signal.signal(signal.SIGINT, signal.SIG_IGN)
print(find_interesting_children())
child = subprocess.Popen(['sleep', '180d'], preexec_fn=preexec_function)
print(find_interesting_children())
If you want to cleanup run later:
for pid in find_interesting_children():
os.kill(pid, signal.SIGTERM)
On a high level, this stems from the choice in this toy example to ignore SIGINT. Sometimes a subprocess may do this and it's out of your control so restarting kernels leaves behind a mess (if you don't ignore SIGINT things work as expected). Can we add logic to handle this variant of process cleanup?
Hi @mlucool, Thanks for the test case!
I've spent the better part of today looking into this and it's quite interesting (IMHO). There are three areas of interest, two of which are still a bit of mystery to me.
LocalProvisioner
has a regression issue in that it's issuing Popen kill()
and terminate()
commands directly, rather than preferring to issue the respective signals (SIGKILL
, SIGTERM
) against the process group - as was the case in 6.x. This issue, though, does not manifest itself in the current incantation of things (more below).SIGINT
has an impact when ignoring any other signal (or the default SIGINT
handler) does not!now
parameter) is "graceful", meaning that a shutdown_request
message is sent and, unless there are issues, nothing else occurs (if the message is detected as not having worked [via polling], then SIGTERM/SIGKILL are sent). "Immediate" shutdown issues a kill()
call directly. (The regression mentioned in item 1 is that this should use the signal against the process group, before issuing kill()
.)What I don't understand is why handling SIGINT
makes a difference here. Seems like it, coupled with the message-based shutdown, is an area that should be looked at (at the kernel level I presume). (Does the kernel's message-based shutdown try to send SIGINT
to its process group as its form of cleaning up child processes by any chance? Just trying to explain what is seen.)
One thing we could do relative to restarts is perform shutdown in an immediate fashion (now=True
). This will then trigger the signal/process-group form of shutdown (provided the LocalProvisioner
is fixed).
This would still leave an issue when just terminating a kernel that has set up its child processes to ignore SIGINT
since that uses "graceful" termination.
For now, I'd like to fix the regression issue in LocalProvisioner
. Of course, this fix won't be "visible" until immediate shutdowns are more prevalent.
Should we make the shutdown issued from restart_kernel
use now=True
and bypass the message-based shutdown on restarts? Another approach might be to update the handler in jupyter_server to use now=True
when restarting the kernel.
It would be great to hear from someone with some kernel-side experience and how "shutdown-request" with ignored SIGINT
comes into play.
CC @Carreau
I think what's happening is as follows (I am not familiar with the client or kernel code, so this is a guess):
Proposal (maybe this should move to IPykernel/IPython?):
On SIGINT the kernel does the normal SIGINT things but does not exit until all its children exit first.
The SIGINT
issued during shutdown_kernel()
is merely an attempt to have any current processing interrupted prior to the process's termination (via "shutdown_request" and, possibly, SIGTERM/SIGKILL
). It was (relatively) recently added and is literally the interrupt_kernel()
method itself. It is not meant to terminate anything.
I think the parent process should be responsible for the termination of its immediate children (and so on). We have the ability to cheat when the kernel is a local process and send a signal to the process group but (per my previous comment) that only occurs in "fallback" situations when the message-based shutdown request has proven ineffective. (Note: this was one of the primary reasons the interrupt was introduced in the shutdown sequence, so the kernel could respond to "shutdown_request" messages.)
I would agree that when receiving a "shutdown_request" message, the kernel process should do whatever it can to terminate its children. This could probably be something like os.killpg(os.getpgrp(), SIGTERM)
(and the equivalent windows code) as its last act.
FTR, ipykernel 6.9.2 fixes https://github.com/jupyter/jupyter_client/issues/104#issuecomment-1028388087.
Is there a way to make jupyter NOT stop sub-processes?
Hi @alita-moore. Life-cycle management of the kernel process (i.e., the sub-preocess) is the responsibility of the kernel provisioner. This gives one the ability to implement whatever kind of process management they'd like beyond what is provided by the local provisioner by default.
Restarting a kernel only kills the direct child, but not any opened subprocesses of that child.
This is a problem if the kernel spawns multiple childprocess like the R kernel (on windows:
R - cmd - R - cmd - Rterm
) or if the python kernel opens a subprocess which sleeps.Issue on the irkernel repo: https://github.com/IRkernel/IRkernel/issues/226
reproducible example (at least on windows): put it into a cell, execute it and then restart the kernel via the notebook UI.