Backgrounded Invoke processes halt execution on run()

bitprophet commented 7 years ago

See bottom half of this comment as well as this later comment.

tl;dr something we're doing in Runner - I'm guessing some of the ioctl junk but not really sure - is messing with shell backgrounding such that the process has to be foregrounded again to continue execution.

It only occurs when a real pty is attached; headless processes (like the kind that used to break before #425's main issue was fixed) don't exhibit it. Which is why I'm guessing it's to do with the "passes isatty/has_fileno tests" code branches.

Issue started around 0.12.x in case that helps us narrow it down.

bitprophet commented 7 years ago

@tuukkamustonen - what exact OS, shell, and versions of each are you seeing this bug under? I'm assuming it'll affect most/all POSIX shells, but you never know...

tuukkamustonen commented 7 years ago

I'm running:

Kubuntu 14.04 LTS (as VirtualBox guest).
GNU bash version 4.3.11(1)-release (x86_64-pc-linux-gnu).

Also tried zsh (5.0.2) and the result is similar.

bitprophet commented 7 years ago

Started trying to reproduce this locally...I can kinda do it but I actually get a nastier looking error when I try to fg after the backgrounded process complains via shell job control. This is the same even back on 0.16.3 (i.e. before the changes from #425).

» python backgrounding.py &
[1] 62549
sleeping...
» running command...

[1]  + suspended (tty output)  python backgrounding.py
» fg
[1]  + continued  python backgrounding.py
jforcier
Traceback (most recent call last):
  File "backgrounding.py", line 9, in <module>
    run('whoami')
  File "/Users/jforcier/Code/oss/invoke/invoke/__init__.py", line 38, in run
    return Context().run(command, **kwargs)
  File "/Users/jforcier/Code/oss/invoke/invoke/context.py", line 78, in run
    return runner_class(context=self).run(command, **kwargs)
  File "/Users/jforcier/Code/oss/invoke/invoke/runners.py", line 259, in run
    return self._run_body(command, **kwargs)
  File "/Users/jforcier/Code/oss/invoke/invoke/runners.py", line 353, in _run_body
    raise ThreadException(thread_exceptions)
invoke.exceptions.ThreadException:
Saw 1 exceptions within threads (error):

Thread args: {'kwargs': {'echo': None,
            'input_': <open file '<stdin>', mode 'r' at 0x10dc640c0>,
            'output': <open file '<stdout>', mode 'w' at 0x10dc64150>},
 'target': <bound method Local.handle_stdin of <invoke.runners.Local object at 0x10e515c90>>}

Traceback (most recent call last):

  File "/Users/jforcier/Code/oss/invoke/invoke/util.py", line 156, in run
    super(ExceptionHandlingThread, self).run()

  File "/usr/local/var/pyenv/versions/2.7.12/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)

  File "/Users/jforcier/Code/oss/invoke/invoke/runners.py", line 630, in handle_stdin
    with character_buffered(input_):

  File "/usr/local/var/pyenv/versions/2.7.12/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()

  File "/Users/jforcier/Code/oss/invoke/invoke/platform.py", line 119, in character_buffered
    tty.setcbreak(stream)

  File "/usr/local/var/pyenv/versions/2.7.12/lib/python2.7/tty.py", line 36, in setcbreak
    tcsetattr(fd, when, mode)

error: (4, 'Interrupted system call')

At a glance, my shell (zsh 5.2 on OS X Sierra; it's the system zsh, not from brew) is giving us a tiny bit more info than Tuukka's, with the suspended (tty output) as explanation for the job control state change.

And well, yea - run('whoami') would by default be mirroring the subprocess' stdout/err to the controlling TTY's stdout/err. So at one level maybe this isn't even technically a bug, but "backgrounding something trying to write to stdout doesn't work well generally"? But I suspect that's not the end of it by any means.

There's a few ways I can go from here, brainstorming:

[ ] Learn more about shell job control details, because I don't actually know a ton about them...
[ ] Double check if/how behavior differs in a subshell, as per Tuukka's very first comment on this topic in the other ticket. I.e. (inv blah)& wait as opposed to simply inv blah &
[ ] Double check how a Linux environment changes things
[ ] Figure out the traceback I got; feels like it's because stdin is becoming disassociated by the backgrounding and we're continuing to miss some spots where we ought to detect this and behave differently (this was my original guess at the problem: when stdin isn't usefully attached to a tty, we should not be trying to do things like tty.setcbreak on them...then again, that presumably occurred on the foregrounding where the TTY is present...)
- And perhaps the true core issue is that the stream state may change over time - our tests here start out backgrounded but what if somebody started in the foreground and then wanted to ^Z + bg? Probably the exact same problem.
- That said I think most of our stream-touching bits perform their capability tests as late as possible, so this may not be that important.

bitprophet commented 7 years ago

At a quick glance the Interrupted system call is symbolic of receiving an EINTR event from, I'm betting, one of our read or write calls. Not sure offhand whether it would've been interrupted while backgrounded (and we then only see the traceback upon foregrounding) or if the act of foregrounding is the trigger. (See note above about needing to grok all this more.)

Definitely need more instrumentation to log file to figure out exactly what's happening from the perspective of the Invoke process during this particular flow.

bitprophet commented 7 years ago

On my Trusty VM (and zsh 5.0.2) I get the foregrounding behavior, again with suspended (tty output), but as with Tuukka I do not get an EINTR related traceback; instead the foregrounded process just continues happily.

(As an aside, unless it ends up being fixed by an overall fix, I will leave that OS X specific issue as another issue, as I expect this use case to be chiefly encountered on Linux-oriented build systems...)

Starting to poke more:

Adding hide=True, meaning no stream mirroring, to the run calls makes no difference, so it's likely that this is related to our manipulation of stdin and not our writes to stdout/`stderr
Confirmed that 0.11.1 doesn't have the issue either way and that 0.12.0 does; 0.12 made lots of changes to stdout/err/in mirroring (including forcing stdin to be bytewise) in order to implement autoresponding, so...yea. it's somewhere in that big pile of stuff.
As suspected, when I add a lot of debugging, the last thing to occur prior to the pausing-to-be-foregrounded is setting stdin to be character buffered.
That's in platform.character_buffered, and after a presumably successful not isatty(stream) test, meaning the stream appears to be a TTY.
- So even if the shell has attempted to disassociate somehow, our test about tty-ness is still passing. Specifically, going by my output, stream.isatty() is returning True, where stream is the usual <open file '<stdin>', ...> (aka a Python 2 sys.stdin).
In testing, either of the tty.setcbreak or tty.tcgetattr calls is sufficient to make the shell get mad about things. Not really surprising though.
So the question that occurs to me is "can we even tell that sys.stdin is backgrounded?" which brings me right back to "What exactly is backgrounding doing in the first place?"

bitprophet commented 7 years ago

The behavior is nearly identical when we 'disown' the process with &! (which is actually a mite odd, as i still see the output, so clearly the tty attachment is identical, simply the job stops being controllable via job control. It remains a child of zsh too, instead of becoming a child of init as I was expecting. Meh.)
With code all set up regularly, the hang happens at setcbreak, and prior to that point, both character_buffered, isatty(stdin) and has_fileno(stdin) - our current tests for tty-like things - say True. So as I suspected they won't help as-is.
- This is the same if we comment setcbreak out; we just hang on tcsetattr (not tcgetattr, btw) and even at that point, stdin passes our tests.

bitprophet commented 7 years ago

Note: while this was obvious from the messages my shell has been sending me, it's laid out very clearly in man zshall:

A job being run in the background will suspend if it tries to read from the terminal.

So unless we can find a useful test to detect backgrounding & pre-emptively avoid touching stdin, there may be no fix for this besides documentation/education. I.e. recommend that users run backgrounded processes with an explicitly disassociated stdin; because as expected, invoke-thing < /dev/null & backgrounds fine and behaves exactly like the pre-0.12 use case, printing everything happily and never being paused.

bitprophet commented 7 years ago

Rummaged around on the stdin obj and in the os module but didn't see anything clearly helpful...decided to google before rummaging around harder in eg termios, and this SO post (because there's always an SO post...) shows that os.tcgetpgrp may be the key:

It obtains the process group ID associated with a given terminal FD
Similarly to how os.getpgrp does the same for the current Python process
When those are different, it means that the TTY has returned to the control of some other process besides the Python one - typically, of course, the invoking shell.

Sure enough, a basic test:

» python -c "import os; print(os.getpgrp()); import sys; print(os.tcgetpgrp(sys.stdin.fileno()))"
8848
8848
» python -c "import os; print(os.getpgrp()); import sys; print(os.tcgetpgrp(sys.stdin.fileno()))" &
[1] 8881
8881
1338
[1]  + done       python -c

Note how the foreground test has same PGID, but background test shows stdin is attached to PID 1338 (which is the zsh process this is all running in).

Time to see if this can be put into practice inside Invoke...

bitprophet commented 7 years ago

Yup, pretty easy, seems to work as advertised, and now things work "correctly" again. Also, kind of expectedly, it fixes the fully-disowned use case as well.

I gotta figure out how to test this, it's an extremely persnickety thing to test automatically, esp given we like to run our tests via ourselves...which naturally run in a headless state...

bitprophet commented 7 years ago

@tuukkamustonen In the meantime I'd love a confirmation that the commit showing up above, also fixes this in your real world CI server use case, and not just this base case!

tuukkamustonen commented 7 years ago

That sounded like a fun trip! :)

I can confirm your magic doing the trick - background tasks no longer halt in my local bash. No change in behavior in Jenkins (because it was already working there).

Also tried sub-shell/background invocation in Jenkins and it worked. I did not run the exact same commands as earlier (because I switched to GNU parallel for concurrent execution), but I am not doing anything special there so I think your fix just works.

(Bash and zsh seem to report the completion of a background command in a slightly different way, but yeah, that's just difference in shells, not something related to invoke.)

bitprophet commented 7 years ago

Awesome, glad to hear it!

Just gotta test this...I tried my hand at an integration level test yesterday but it was hard, as expected, due to not having a real terminal handy (which is compounded on Travis, to boot). Think I will just put in a unit-style test instead; less than ideal, but eh.

pyinvoke / invoke

Backgrounded Invoke processes halt execution on run() #439