Closed pmdarrow closed 8 years ago
I'm :+1: on this
I'm also willing to contribute for this if I'm pointed in the right direction.
@presidento Just looking at the code, I would say that the place where handling of KeyboardInterrupt
exception should be added is here.
@frol I don't think so. IMHO, the best way to handling Ctrt+C would be sending the SIGKILL signal to the external process and wait gracefully until it stops. Now the behaviour is very different (#315): Invoke exits, while the started process remain running in the background.
Also I don't want to Invoke handle the Ctrl+C, because we frequently use invoke run bash
. And I run other commands within the bash shell, so it would be good to return into it after pressing Ctrl+C.
@presidento I see your point. Then it seems it should be also patched here. Catch KeyboardInterrupt
exception, send SIGTERM
to a child (why would you want to kill it with SIGKILL
?), wait again and once it is terminated, I would raise KeyboardInterrupt
to exit since this is what I would expect as a user instead of continuing running. (Also, you may consider handling subsequent CTRL+C hits, which will kill a misbehaved child with SIGKILL
signal.)
My off the cuff feedback:
except
, or somewhere soon afterwards, is probably where we ought to signal the subprocess.
Runner
method + explicit Local
method because the act of signaling differs between local and remote execution.SIGTERM
is probably the best default to go with, though making this configurable might be a good idea...Poking at this myself now, referencing #331 as a jumping-off point (thanks @presidento!). Notes as I go:
wait
" inside run
instead of wait
itself, as it's implementation-agnostic & also just felt cleaner.
signal_interrupt()
as the implementation-specific subroutine that handles signalingkillpg
instead of kill
since that does feel like the most correct call (more likely to correctly handle situations where the subprocess spawns its own subprocesses)pty=False
(which also includes some Windows considerations)EDIT: getting permission denied errors on killpg
; digging into that next.
Seems like the killpg
thing (I get OSError: permission denied
) is a Darwin/BSD problem (e.g. this); but it works OK on Linux.
That said, regular kill
seems to work OK on both platforms, and while submitting the signal to the whole process group sounds nice, I'm not sure whether it's a big enough benefit to be worth the apparent cross-platform issues. E.g. I'd expect most subprocesses that spawn their own subprocesses to handle a single SIGINT sent to just-them, gracefully enough.
So I'm probably going to go with kill
for now, tho as usual I'm open to arguments. (Just, the argument needs to overcome the platform issue or the complexity cost of working around it :))
Using killpg
was not conscious, I just googled for an existing problem and have found a "works for me" solution (for Linux). So I don't have arguments againts kill
.
Great, glad to hear it! :)
Back on this. Works reasonably well in practice; been poking at a non crap integration test for it, which grew into a generic "assert signal passthrough" setup (though that is only on the test harness side, Invoke itself still only handles SIGINT/KeyboardInterrupt right now).
Actually invoking the test involves starting a subprocess running Invoke, then sending it SIGINT while it's running. Ideally, this wants a truly asynchronous API for run
, but I don't have time to enter that rabbit hole right now. Basic use of threading works pretty well in its stead.
Need to tidy up the WIP (eg merge my tiny threading handler with runners._IOThread
) and then finish up with the non-pty implementation & test cases (the base case just looks at pty=True).
I'm apparently a moron and this may not have worked well after all. Suspect the problem was that I worked on the test harness stuff mostly on its own, and goofed hooking it up to the real behavior.
(This was complicated by the fact that the sub-subprocess expecting a given signal, has no actual way of communicating cleanly with the test runner, besides parsing stdout/stderr. So I was actually missing some "failures" at some point.)
What seems to be happening so far:
send_interrupt
)This is not happening in the unit tests, but that's because they're unit tests & have mocking designed to allow things to flow fast-but-clean (and the "submit signal downstream" code is not actually broken - os.kill
does get called eventually - it just isn't happening until after the subprocess exits naturally).
I would consider skipping past this because ugh, time/complex, but the fact that unit tests can't see this problem, exposes the need for solid integration tests.
OK, as I suspected earlier but then forgot while writing the above: it's because the stdin handler is the only one honoring program_finished
(and this is because it needs that flag to know when to shut down, as Invoke's own stdin is typically an interactive terminal pipe with no clearly defined end).
Presently, the stdout/err handlers prefer to consume "the whole stream" to avoid race conditions, so they ignore that Event object. But that's now actually preventing us from sending the signal that would cause those streams to terminate, in a nice Catch-22.
So I see two ways out of this:
send_interrupt
at KeyboardInterrupt time instead of post-thread-shutdown time. This feels like the right thing to do offhand.Runner
instance - something I'd prefer not to do if I can get away with it.Furthermore, while I'm not 100% sure about the why, the problem only happens for PTYs; in the no-PTY case, the signal seems to be getting submitted downstream automatically & immediately - when I test this on the integration's helper module, it sees SIGINT before we even get around to calling send_interrupt
. (This also means we probably don't actually need to do anything there in this case, perhaps...perhaps on Windows though).
Offhand I'm thinking this may be due to no-pty using straight up subprocess.Popen
, meaning the subprocess is a direct child of the Invoke process, and my testing right now is straight up Ctrl-C in a terminal, which IIRC acts like pgkill
instead of pkill
.
EDIT: yea, if I change to using pkill -INT
from another terminal, the same hanging behavior occurs, even with pty=False. And google confirms - Ctrl-C being interpreted as "SIGINT to foreground process group" is POSIX behavior. Finally, I doublechecked using (Darwin/BSD) ps -ax -O tpgid
(control terminal process group ID column displayed) and yes, with no-pty, both invoke and the subprocess share a process group, but with pty, the pty-driven subprocess gets its own process group.
send_interrupt
's call to time-of-except
works much better, re: by-hand testing & integration testsignaling.py
nicely exits w/ exceptions when unexpected things happen, it's being run via the middle Invoke, which is being SIGINT'd, and will thus always exit 130.run
cycle will never complete due to interruption, and thus the fact that the inner process "failed" cannot enter the picture.Updating the innermost script to be good-Unix-citizen and only output on failure, works well enough; then if I revert the implementation to trigger the error case, things fail nicely.
Next is the no-pty use case; as stated it may not be strictly necessary, but we still need to capture SIGINT and submit it in case it is not generated by an interactive Ctrl-C. So far, interactive tests show that it doesn't hurt anything to try sending it "twice" (though I do wonder if there are cases where subprocess.Popen.send_signal
will explode if the subprocess has already exited by the time it's called...).
Beyond that, I'm not planning to tackle the "other signals" problem yet:
signal
or will result in correct-enough cascading process death on most OSes.So those can definitely wait for a solid bug report / PR.
Seems to also be some issue on Travis (because of course there is!) - https://travis-ci.org/pyinvoke/invoke/jobs/117052576
Will try loading up my copy of their docker image sometime and see wtf.
I can recreate this on regular ol Linux, no need for the docker image at this time.
The pkill
used to submit the SIGINT
is what's dying (with -2
, and I can't figure out what that's supposed to mean, neither pkill manpage nor google helps much). When I poke by hand some really weird shit is going on:
pgrep
on this system (Debian 8) is buggy and is finding its own PID...pgrep
by hand in another shell only shows 2 PIDs, the invoke proc and its spawned sh.
pgrep/pkill
find on Darwin - those only ever turn(ed) up the Invoke process. Guessing GNU vs BSD semantics, e.g. maybe the BSD pgrep/pkill only look at process group leaders or something?kill -INT
it, things work correctly, so at least the core concept for the integration test still flies.Dug deeper:
-al
to show full command line (BSD just needs -l
), once I found that I confirmed the 3 PIDs are as expected: invoke, sh, pgrep itselfsh
processes where Darwin is not. While that's perplexing & I really want to/should understand it, it'll have to wait, as stated previously this has been a deep rabbit hole already.python.*bin/inv -c signal_tasks
to select the actual Python process instead of the sh
'wrapper') works equally well on both platforms, by hand.sh
process on Linux, which isn't actually pgrep/pkill itself, and so it's selecting it.sh
process by using pty=True
(since this uses fork
instead of Python's subprocess
module) and this seems like a good-enough workaround for now.Have pty=False integration test working now (plus reworked the IO thread stuff so it's more generic and lives in util
:smile_cat:). Naturally, it too is broken on Travis and thus probably Linux, while working fine on Darwin (EDIT: and Windows/appveyor - phew!). poking...
pkill
is still "working" insofar as Invoke sees the signal and gets around to calling subprocess.Popen.send_signal
send_signal
is clearly not having the desired effect in this case. Figuring out why.Yea, as before, the issue here is Darwin subprocess
is not generating intermediate sh
processes, but is instead directly spawning the inner Python. (Sending SIGINT to either has the same result, which is good/expected.)
On Linux, the intermediate sh
is there, and sending SIGINT to it doesn't percolate into the innermost Python process. (However, sending SIGINT directly to that inner process does satisfy its signal handlers - tho it'd be even stranger if it didn't.)
Unlike with the previous problem, I can't "work around" this, because it means non-pty (which is the default too!) behavior won't pass-through SIGINT cleanly. So I do have to figure this out after all =/
subprocess.Popen(shell=True)
uses /bin/sh
, hardcoded
/bin/sh
is GNU bash 3/bin/sh
is a symlink to dash
(0.5.7
for whatever that's worth)run(pty=True)
uses /bin/bash
for the time being; I wonder if changing it to /bin/sh
would yield similar issues on Linux for that use case too; something to check./bin/sh
to link to bash
not dash
might be a fun experiment (presuming it doesn't make my VM explode, but I doubt it would unless I tried poking init scripts/etc)bash
on Darwin is doing some sort of exec
of its arguments (replacing the shell process with the command executed), and dash
on Debian isn't.
/bin/sh
processes (or even any related bash
processes) on my Darwin system while Invoke + sleeping signaling.py
are running; so it's not like my pstree
is hiding things (presumably).subprocess
at this point, given we handle so much of the work ourselves anyways.
os.fork
so if it comes down to us just toggling use of os.fork
vs pty.fork
(& other odds and ends truly pty-specific)...maybe?send_signal
is spurious because the shell/terminal signals the entire process group: this results in signaling.py
emitting two "success" messages (when tweaked to do so), versus only one on Linux.subprocess
is the same on both systems, there's no platform-specific stuff besides windows-vs-posix. Both are ending up with args: ['/bin/sh', '-c', 'python signaling.py SIGINT'], executable: /bin/sh
in my debuggery./bin/sh -c "python signaling.py SIGINT"
in my shell (note: it's zsh on both, which shouldn't matter here), I get the same pstree structures as when subprocess
does the same: Darwin is just zsh -> python
, Debian is zsh -> sh -> python
./bin/bash
in place of /bin/sh
on Debian, does what I expected too: it does exactly what bash
on Darwin does: I'm left with zsh -> python
. So this behavior is specific to the system-local /bin/sh
. (So much for it being portable!)/bin/sh
in the non-pty use case, for maximum portability/compatibility" would probably just make both use cases break in this manner, in this situation.[EXPLETIVE DELETED]
.local()
on the same systems...shouldn't it have the exact same problem? I don't recall reports of that but they may exist. Something worth checking.Then again, I just realized, that the Ctrl-C use case isn't broken, it's only regular submission of SIGINT
that is broken here. The kill-foreground-process-group behavior of a real shell's Ctrl-C suffices here. (Too bad there's no apparent way to force a process-group variant of send_signal
...)
So this probably can just wait until it bites somebody; submitting a manual SIGINT
instead of using Ctrl-C seems like it'd be quite rare. I'm very very sick of tickets that sound simple and end up taking an inordinate amount of time to correctly handle even the base 80% of real use cases :(
For the record, when I did check Fabric 1's behavior, it does seem to work better; obviously, it still results in an intermediate sh
process, but kill -INT <fabric PID>
causes things to shut down correctly.
The primary differences between the ways we use subprocess
are:
None
(meaning inheritance) and we of course set it to PIPE
so we can intercept them.close_fds=True
when not-Windows; we leave it as the default, which is False
.
close_fds=True
in Invoke makes zero difference re: this particular behavior.Merged to master \o/
And Travis reminds me that the whole reason I was poking this last bit, is it is breaking the integration test on Linux. Got so wrapped up in figuring out WTF I forgot that it needs addressing in some fashion.
I may just comment out that last integration test for now...will sleep on it.
Glad I slept on it, realized this morning a decent workaround is to expose a config option for the executable and use it in both pty and non-pty use cases:
/bin/bash
) and non-pty (/bin/sh
)/bin/sh
and they're sending raw SIGINT to dash-based sh
...) can tweak it/bin/bash
Going to make that a new ticket (EDIT: reused old one, #67) since from most perspectives it's not technically related to this one.
Implemented #67 and the tests pass, including the one that was breaking for this issue (because all shells are now /bin/bash by default). Toot toot.
Hi @bitprophet, thank you for the detailed log! I think, handling SIGINT correctly is far enough now.
(Didn't you want to add it to the 0.12.3 milestone?)
Hi !
I just tested: it seams to works well only when pty=False
.
When having pty=True
, signals (tested with SIGINT
) is not relayed to the child process.
The integration tests for this are, at present, failing intermittently. Not sure if some other change in the interim is causing or what.
They're already "meh" because there's a sleep
in there to prevent the signal from happening before the inner Python interpreter has fully started up; unclear if that is the reason things fail or not (I bumped it from 1s to 2s briefly and made no obvious difference).
Happening mostly/entirely under Python 3, probably because it's slower. (I earlier noted it was on both but that was my mistake, I erroneously bumped the sleep
but not the alarm
. Comment added to prevent that in future.)
@noirbizarre In real world testing (unrelated to my note above about the integration tests) it still works fine for me with pty=True
. Can you provide more details sometime?
Simply bumping the alarm
to 2s from 1s makes this appear to go away; guessing the issue was that on my environment & under Python 3, I was hitting a race condition.
Re-closing, @noirbizarre do lmk if you can reliably recreate your problem - all details appreciated.
Interrupting a running task with Ctrl+C results in this traceback:
I'm migrating from Fabric which handles SIGINT nicely by exiting with
sys.exit(1)
and printingStopped.
. It also properly forwards the interrupt to any processes started withfabric.api.local()
. How can I get this behaviour in Invoke?Willing to contribute a fix for this if I'm pointed in the right direction!