Don't treat WaitWritable on stdin as an error

andrew-stripe commented 7 years ago

We've had several reports of people running OS X getting an EAGAINWaitWritable error from Subprocess writing to stdin. Rebooting the machine seems to fix the issue. On a non-rebooted machine, I was able to verify that after writing 512 bytes to a new pipe, subsequent writes would throw an error even immediately after select indicated that the fd is writable:

>> data = "a" * 819;
>> r, w = IO.pipe;
>> IO.select([], [w])
[[], [#<IO:fd 11>], []]
>> w.write_nonblock(data)
512
>> IO.select([], [w])
[[], [#<IO:fd 11>], []]
>> w.write_nonblock(data)
<raises EAGAINWaitWritable>
>> IO.select([], [w])
[[], [#<IO:fd 11>], []]

I was also able to confirm that this diff fixes the observed issue and the Subprocess completes cleanly.

There's some precedent for making this change:

As far as I can tell, OpenSSH swallows EAGAIN and continues looping: https://github.com/openssh/openssh-portable/blob/ca04de83f210959ad2ed870a30ba1732c3ae00e3/packet.c#L2178-L2180
In another place, OpenSSH notes that Solaris has experienced a similar kernel bug: https://github.com/openssh/openssh-portable/blob/8d0578478586e283e751ca51e7b0690631da139a/clientloop.c#L780-L786
The POSIX standard states that it's acceptable for a write to return an error even if select indicates writability:

A descriptor shall be considered ready for reading when a call to an input function with O_NONBLOCK clear would not block, whether or not the function would transfer data successfully. (The function might return data, an end-of-file indication, or an error other than one indicating that it is blocked, and in each of these cases the descriptor shall be considered ready for reading.)

A descriptor shall be considered ready for writing when a call to an output function with O_NONBLOCK clear would not block, whether or not the function would transfer data successfully.

If the OS misbehaves and the process blocks forever, the invoking process will spinloop indefinitely. I think this is acceptable since it's already good practice to implement a timeout if you don't trust the child process not to block indefinitely.

andrew-stripe commented 7 years ago

r? @evan-stripe I ran tests locally and they passed, didn't want to wait for Travis to assign for review since getting OS X workers on Travis can take an eternity.

andrew-stripe commented 7 years ago

cc @doy-stripe @asf-stripe @dougbarth-stripe since y'all provided suggestions for tracking this down.

russell-stripe commented 7 years ago

There's some precedent for making this change

Note that this is also what the example in the Ruby docs suggests doing: (at http://ruby-doc.org/core-2.3.0/IO.html#method-c-select):

while 0 < string.bytesize
  begin
    written = io_like.write_nonblock(string)
  rescue IO::WaitReadable
    IO.select([io_like])
    retry
  rescue IO::WaitWritable
    IO.select(nil, [io_like])
    retry
  end
  string = string.byteslice(written..-1)
end

(They also rescue WaitReadable to in case of SSL renegotiation, which we don't need to worry about here.)

evan-stripe commented 7 years ago

I think I'm sufficiently convinced that the test failures over the weekend was either spurious or some sort of test runner issue, so LGTM

andrew-stripe commented 7 years ago

Yea, I've found the builds on this repo to be somewhat flaky :-/

andrew-stripe commented 7 years ago

Add a version bump to this PR as well.

stripe / subprocess

Don't treat WaitWritable on stdin as an error #22