pyinvoke / invoke

Pythonic task management & command execution.
http://pyinvoke.org
BSD 2-Clause "Simplified" License
4.38k stars 367 forks source link

Interrupting a running task with Ctrl+C #152

Closed pmdarrow closed 8 years ago

pmdarrow commented 10 years ago

Interrupting a running task with Ctrl+C results in this traceback:

^CTraceback (most recent call last):
  File "/Users/peter/Envs/are/bin/inv", line 9, in <module>
    load_entry_point('invoke==0.8.2', 'console_scripts', 'inv')()
  File "/Users/peter/Envs/are/lib/python2.7/site-packages/invoke/cli.py", line 295, in main
    dispatch(sys.argv)
  File "/Users/peter/Envs/are/lib/python2.7/site-packages/invoke/cli.py", line 288, in dispatch
    return executor.execute(*tasks, dedupe=dedupe)
  File "/Users/peter/Envs/are/lib/python2.7/site-packages/invoke/executor.py", line 89, in execute
    task=task, name=name, args=args, kwargs=kwargs
  File "/Users/peter/Envs/are/lib/python2.7/site-packages/invoke/executor.py", line 128, in _execute
    return task(*args, **kwargs)
  File "/Users/peter/Envs/are/lib/python2.7/site-packages/invoke/tasks.py", line 108, in __call__
    result = self.body(*args, **kwargs)
  File "/Users/peter/Code/are-platform/API/tasks.py", line 56, in start
    run_env('python are/servers/dev_server.py', shell_env, verbose)
  File "/Users/peter/Code/are-platform/API/tasks.py", line 27, in run_env
    run('%s %s' % (env_vars, cmd))
  File "/Users/peter/Envs/are/lib/python2.7/site-packages/invoke/runner.py", line 143, in run
    stdout, stderr, exited, exception = func(command, warn, hide)
  File "/Users/peter/Envs/are/lib/python2.7/site-packages/invoke/runner.py", line 65, in run
    stdout, stderr = process.communicate()
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 798, in communicate
    return self._communicate(input)
  File "/Users/peter/Envs/are/lib/python2.7/site-packages/invoke/monkey.py", line 58, in _communicate
    rlist, wlist, xlist = select.select(read_set, write_set, [])
KeyboardInterrupt

I'm migrating from Fabric which handles SIGINT nicely by exiting with sys.exit(1) and printing Stopped.. It also properly forwards the interrupt to any processes started with fabric.api.local(). How can I get this behaviour in Invoke?

Willing to contribute a fix for this if I'm pointed in the right direction!

sophacles commented 10 years ago

I'm :+1: on this

presidento commented 8 years ago

I'm also willing to contribute for this if I'm pointed in the right direction.

frol commented 8 years ago

@presidento Just looking at the code, I would say that the place where handling of KeyboardInterrupt exception should be added is here.

presidento commented 8 years ago

@frol I don't think so. IMHO, the best way to handling Ctrt+C would be sending the SIGKILL signal to the external process and wait gracefully until it stops. Now the behaviour is very different (#315): Invoke exits, while the started process remain running in the background.

Also I don't want to Invoke handle the Ctrl+C, because we frequently use invoke run bash. And I run other commands within the bash shell, so it would be good to return into it after pressing Ctrl+C.

frol commented 8 years ago

@presidento I see your point. Then it seems it should be also patched here. Catch KeyboardInterrupt exception, send SIGTERM to a child (why would you want to kill it with SIGKILL?), wait again and once it is terminated, I would raise KeyboardInterrupt to exit since this is what I would expect as a user instead of continuing running. (Also, you may consider handling subsequent CTRL+C hits, which will kill a misbehaved child with SIGKILL signal.)

bitprophet commented 8 years ago

My off the cuff feedback:

bitprophet commented 8 years ago

Poking at this myself now, referencing #331 as a jumping-off point (thanks @presidento!). Notes as I go:

EDIT: getting permission denied errors on killpg; digging into that next.

bitprophet commented 8 years ago

Seems like the killpg thing (I get OSError: permission denied) is a Darwin/BSD problem (e.g. this); but it works OK on Linux.

That said, regular kill seems to work OK on both platforms, and while submitting the signal to the whole process group sounds nice, I'm not sure whether it's a big enough benefit to be worth the apparent cross-platform issues. E.g. I'd expect most subprocesses that spawn their own subprocesses to handle a single SIGINT sent to just-them, gracefully enough.

So I'm probably going to go with kill for now, tho as usual I'm open to arguments. (Just, the argument needs to overcome the platform issue or the complexity cost of working around it :))

presidento commented 8 years ago

Using killpg was not conscious, I just googled for an existing problem and have found a "works for me" solution (for Linux). So I don't have arguments againts kill.

bitprophet commented 8 years ago

Great, glad to hear it! :)

bitprophet commented 8 years ago

Back on this. Works reasonably well in practice; been poking at a non crap integration test for it, which grew into a generic "assert signal passthrough" setup (though that is only on the test harness side, Invoke itself still only handles SIGINT/KeyboardInterrupt right now).

Actually invoking the test involves starting a subprocess running Invoke, then sending it SIGINT while it's running. Ideally, this wants a truly asynchronous API for run, but I don't have time to enter that rabbit hole right now. Basic use of threading works pretty well in its stead.

Need to tidy up the WIP (eg merge my tiny threading handler with runners._IOThread) and then finish up with the non-pty implementation & test cases (the base case just looks at pty=True).

bitprophet commented 8 years ago

I'm apparently a moron and this may not have worked well after all. Suspect the problem was that I worked on the test harness stuff mostly on its own, and goofed hooking it up to the real behavior.

(This was complicated by the fact that the sub-subprocess expecting a given signal, has no actual way of communicating cleanly with the test runner, besides parsing stdout/stderr. So I was actually missing some "failures" at some point.)

What seems to be happening so far:

This is not happening in the unit tests, but that's because they're unit tests & have mocking designed to allow things to flow fast-but-clean (and the "submit signal downstream" code is not actually broken - os.kill does get called eventually - it just isn't happening until after the subprocess exits naturally).

I would consider skipping past this because ugh, time/complex, but the fact that unit tests can't see this problem, exposes the need for solid integration tests.

bitprophet commented 8 years ago

OK, as I suspected earlier but then forgot while writing the above: it's because the stdin handler is the only one honoring program_finished (and this is because it needs that flag to know when to shut down, as Invoke's own stdin is typically an interactive terminal pipe with no clearly defined end).

Presently, the stdout/err handlers prefer to consume "the whole stream" to avoid race conditions, so they ignore that Event object. But that's now actually preventing us from sending the signal that would cause those streams to terminate, in a nice Catch-22.

So I see two ways out of this:

bitprophet commented 8 years ago

Furthermore, while I'm not 100% sure about the why, the problem only happens for PTYs; in the no-PTY case, the signal seems to be getting submitted downstream automatically & immediately - when I test this on the integration's helper module, it sees SIGINT before we even get around to calling send_interrupt. (This also means we probably don't actually need to do anything there in this case, perhaps...perhaps on Windows though).

Offhand I'm thinking this may be due to no-pty using straight up subprocess.Popen, meaning the subprocess is a direct child of the Invoke process, and my testing right now is straight up Ctrl-C in a terminal, which IIRC acts like pgkill instead of pkill.

EDIT: yea, if I change to using pkill -INT from another terminal, the same hanging behavior occurs, even with pty=False. And google confirms - Ctrl-C being interpreted as "SIGINT to foreground process group" is POSIX behavior. Finally, I doublechecked using (Darwin/BSD) ps -ax -O tpgid (control terminal process group ID column displayed) and yes, with no-pty, both invoke and the subprocess share a process group, but with pty, the pty-driven subprocess gets its own process group.

bitprophet commented 8 years ago
bitprophet commented 8 years ago

Updating the innermost script to be good-Unix-citizen and only output on failure, works well enough; then if I revert the implementation to trigger the error case, things fail nicely.

Next is the no-pty use case; as stated it may not be strictly necessary, but we still need to capture SIGINT and submit it in case it is not generated by an interactive Ctrl-C. So far, interactive tests show that it doesn't hurt anything to try sending it "twice" (though I do wonder if there are cases where subprocess.Popen.send_signal will explode if the subprocess has already exited by the time it's called...).

Beyond that, I'm not planning to tackle the "other signals" problem yet:

So those can definitely wait for a solid bug report / PR.

bitprophet commented 8 years ago

Seems to also be some issue on Travis (because of course there is!) - https://travis-ci.org/pyinvoke/invoke/jobs/117052576

Will try loading up my copy of their docker image sometime and see wtf.

bitprophet commented 8 years ago

I can recreate this on regular ol Linux, no need for the docker image at this time.

The pkill used to submit the SIGINT is what's dying (with -2, and I can't figure out what that's supposed to mean, neither pkill manpage nor google helps much). When I poke by hand some really weird shit is going on:

bitprophet commented 8 years ago

Dug deeper:

bitprophet commented 8 years ago

Have pty=False integration test working now (plus reworked the IO thread stuff so it's more generic and lives in util :smile_cat:). Naturally, it too is broken on Travis and thus probably Linux, while working fine on Darwin (EDIT: and Windows/appveyor - phew!). poking...

bitprophet commented 8 years ago

Yea, as before, the issue here is Darwin subprocess is not generating intermediate sh processes, but is instead directly spawning the inner Python. (Sending SIGINT to either has the same result, which is good/expected.)

On Linux, the intermediate sh is there, and sending SIGINT to it doesn't percolate into the innermost Python process. (However, sending SIGINT directly to that inner process does satisfy its signal handlers - tho it'd be even stranger if it didn't.)

Unlike with the previous problem, I can't "work around" this, because it means non-pty (which is the default too!) behavior won't pass-through SIGINT cleanly. So I do have to figure this out after all =/

bitprophet commented 8 years ago
bitprophet commented 8 years ago
bitprophet commented 8 years ago

Then again, I just realized, that the Ctrl-C use case isn't broken, it's only regular submission of SIGINT that is broken here. The kill-foreground-process-group behavior of a real shell's Ctrl-C suffices here. (Too bad there's no apparent way to force a process-group variant of send_signal...)

So this probably can just wait until it bites somebody; submitting a manual SIGINT instead of using Ctrl-C seems like it'd be quite rare. I'm very very sick of tickets that sound simple and end up taking an inordinate amount of time to correctly handle even the base 80% of real use cases :(


For the record, when I did check Fabric 1's behavior, it does seem to work better; obviously, it still results in an intermediate sh process, but kill -INT <fabric PID> causes things to shut down correctly.

The primary differences between the ways we use subprocess are:

bitprophet commented 8 years ago

Merged to master \o/

bitprophet commented 8 years ago

And Travis reminds me that the whole reason I was poking this last bit, is it is breaking the integration test on Linux. Got so wrapped up in figuring out WTF I forgot that it needs addressing in some fashion.

I may just comment out that last integration test for now...will sleep on it.

bitprophet commented 8 years ago

Glad I slept on it, realized this morning a decent workaround is to expose a config option for the executable and use it in both pty and non-pty use cases:

Going to make that a new ticket (EDIT: reused old one, #67) since from most perspectives it's not technically related to this one.

bitprophet commented 8 years ago

Implemented #67 and the tests pass, including the one that was breaking for this issue (because all shells are now /bin/bash by default). Toot toot.

presidento commented 8 years ago

Hi @bitprophet, thank you for the detailed log! I think, handling SIGINT correctly is far enough now.

presidento commented 8 years ago

(Didn't you want to add it to the 0.12.3 milestone?)

noirbizarre commented 8 years ago

Hi !

I just tested: it seams to works well only when pty=False. When having pty=True, signals (tested with SIGINT) is not relayed to the child process.

bitprophet commented 8 years ago

The integration tests for this are, at present, failing intermittently. Not sure if some other change in the interim is causing or what.

They're already "meh" because there's a sleep in there to prevent the signal from happening before the inner Python interpreter has fully started up; unclear if that is the reason things fail or not (I bumped it from 1s to 2s briefly and made no obvious difference).

Happening mostly/entirely under Python 3, probably because it's slower. (I earlier noted it was on both but that was my mistake, I erroneously bumped the sleep but not the alarm. Comment added to prevent that in future.)

bitprophet commented 8 years ago

@noirbizarre In real world testing (unrelated to my note above about the integration tests) it still works fine for me with pty=True. Can you provide more details sometime?

bitprophet commented 8 years ago

Simply bumping the alarm to 2s from 1s makes this appear to go away; guessing the issue was that on my environment & under Python 3, I was hitting a race condition.

bitprophet commented 8 years ago

Re-closing, @noirbizarre do lmk if you can reliably recreate your problem - all details appreciated.