pyinvoke / invoke

Pythonic task management & command execution.
http://pyinvoke.org
BSD 2-Clause "Simplified" License
4.38k stars 367 forks source link

Backgrounded Invoke processes halt execution on run() #439

Closed bitprophet closed 7 years ago

bitprophet commented 7 years ago

See bottom half of this comment as well as this later comment.

tl;dr something we're doing in Runner - I'm guessing some of the ioctl junk but not really sure - is messing with shell backgrounding such that the process has to be foregrounded again to continue execution.

It only occurs when a real pty is attached; headless processes (like the kind that used to break before #425's main issue was fixed) don't exhibit it. Which is why I'm guessing it's to do with the "passes isatty/has_fileno tests" code branches.

Issue started around 0.12.x in case that helps us narrow it down.

bitprophet commented 7 years ago

@tuukkamustonen - what exact OS, shell, and versions of each are you seeing this bug under? I'm assuming it'll affect most/all POSIX shells, but you never know...

tuukkamustonen commented 7 years ago

I'm running:

Also tried zsh (5.0.2) and the result is similar.

bitprophet commented 7 years ago

Started trying to reproduce this locally...I can kinda do it but I actually get a nastier looking error when I try to fg after the backgrounded process complains via shell job control. This is the same even back on 0.16.3 (i.e. before the changes from #425).

» python backgrounding.py &
[1] 62549
sleeping...
» running command...

[1]  + suspended (tty output)  python backgrounding.py
» fg
[1]  + continued  python backgrounding.py
jforcier
Traceback (most recent call last):
  File "backgrounding.py", line 9, in <module>
    run('whoami')
  File "/Users/jforcier/Code/oss/invoke/invoke/__init__.py", line 38, in run
    return Context().run(command, **kwargs)
  File "/Users/jforcier/Code/oss/invoke/invoke/context.py", line 78, in run
    return runner_class(context=self).run(command, **kwargs)
  File "/Users/jforcier/Code/oss/invoke/invoke/runners.py", line 259, in run
    return self._run_body(command, **kwargs)
  File "/Users/jforcier/Code/oss/invoke/invoke/runners.py", line 353, in _run_body
    raise ThreadException(thread_exceptions)
invoke.exceptions.ThreadException:
Saw 1 exceptions within threads (error):

Thread args: {'kwargs': {'echo': None,
            'input_': <open file '<stdin>', mode 'r' at 0x10dc640c0>,
            'output': <open file '<stdout>', mode 'w' at 0x10dc64150>},
 'target': <bound method Local.handle_stdin of <invoke.runners.Local object at 0x10e515c90>>}

Traceback (most recent call last):

  File "/Users/jforcier/Code/oss/invoke/invoke/util.py", line 156, in run
    super(ExceptionHandlingThread, self).run()

  File "/usr/local/var/pyenv/versions/2.7.12/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)

  File "/Users/jforcier/Code/oss/invoke/invoke/runners.py", line 630, in handle_stdin
    with character_buffered(input_):

  File "/usr/local/var/pyenv/versions/2.7.12/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()

  File "/Users/jforcier/Code/oss/invoke/invoke/platform.py", line 119, in character_buffered
    tty.setcbreak(stream)

  File "/usr/local/var/pyenv/versions/2.7.12/lib/python2.7/tty.py", line 36, in setcbreak
    tcsetattr(fd, when, mode)

error: (4, 'Interrupted system call')

At a glance, my shell (zsh 5.2 on OS X Sierra; it's the system zsh, not from brew) is giving us a tiny bit more info than Tuukka's, with the suspended (tty output) as explanation for the job control state change.

And well, yea - run('whoami') would by default be mirroring the subprocess' stdout/err to the controlling TTY's stdout/err. So at one level maybe this isn't even technically a bug, but "backgrounding something trying to write to stdout doesn't work well generally"? But I suspect that's not the end of it by any means.

There's a few ways I can go from here, brainstorming:

bitprophet commented 7 years ago

At a quick glance the Interrupted system call is symbolic of receiving an EINTR event from, I'm betting, one of our read or write calls. Not sure offhand whether it would've been interrupted while backgrounded (and we then only see the traceback upon foregrounding) or if the act of foregrounding is the trigger. (See note above about needing to grok all this more.)

Definitely need more instrumentation to log file to figure out exactly what's happening from the perspective of the Invoke process during this particular flow.

bitprophet commented 7 years ago

On my Trusty VM (and zsh 5.0.2) I get the foregrounding behavior, again with suspended (tty output), but as with Tuukka I do not get an EINTR related traceback; instead the foregrounded process just continues happily.

(As an aside, unless it ends up being fixed by an overall fix, I will leave that OS X specific issue as another issue, as I expect this use case to be chiefly encountered on Linux-oriented build systems...)


Starting to poke more:

bitprophet commented 7 years ago
bitprophet commented 7 years ago

Note: while this was obvious from the messages my shell has been sending me, it's laid out very clearly in man zshall:

A job being run in the background will suspend if it tries to read from the terminal.

So unless we can find a useful test to detect backgrounding & pre-emptively avoid touching stdin, there may be no fix for this besides documentation/education. I.e. recommend that users run backgrounded processes with an explicitly disassociated stdin; because as expected, invoke-thing < /dev/null & backgrounds fine and behaves exactly like the pre-0.12 use case, printing everything happily and never being paused.

bitprophet commented 7 years ago

Rummaged around on the stdin obj and in the os module but didn't see anything clearly helpful...decided to google before rummaging around harder in eg termios, and this SO post (because there's always an SO post...) shows that os.tcgetpgrp may be the key:

Sure enough, a basic test:

» python -c "import os; print(os.getpgrp()); import sys; print(os.tcgetpgrp(sys.stdin.fileno()))"
8848
8848
» python -c "import os; print(os.getpgrp()); import sys; print(os.tcgetpgrp(sys.stdin.fileno()))" &
[1] 8881
8881
1338
[1]  + done       python -c

Note how the foreground test has same PGID, but background test shows stdin is attached to PID 1338 (which is the zsh process this is all running in).

Time to see if this can be put into practice inside Invoke...

bitprophet commented 7 years ago

Yup, pretty easy, seems to work as advertised, and now things work "correctly" again. Also, kind of expectedly, it fixes the fully-disowned use case as well.

I gotta figure out how to test this, it's an extremely persnickety thing to test automatically, esp given we like to run our tests via ourselves...which naturally run in a headless state...

bitprophet commented 7 years ago

@tuukkamustonen In the meantime I'd love a confirmation that the commit showing up above, also fixes this in your real world CI server use case, and not just this base case!

tuukkamustonen commented 7 years ago

That sounded like a fun trip! :)

I can confirm your magic doing the trick - background tasks no longer halt in my local bash. No change in behavior in Jenkins (because it was already working there).

Also tried sub-shell/background invocation in Jenkins and it worked. I did not run the exact same commands as earlier (because I switched to GNU parallel for concurrent execution), but I am not doing anything special there so I think your fix just works.

(Bash and zsh seem to report the completion of a background command in a slightly different way, but yeah, that's just difference in shells, not something related to invoke.)

bitprophet commented 7 years ago

Awesome, glad to hear it!

Just gotta test this...I tried my hand at an integration level test yesterday but it was hard, as expected, due to not having a real terminal handy (which is compounded on Travis, to boot). Think I will just put in a unit-style test instead; less than ideal, but eh.