Handle encoding of output from Process#communicate identically whether a block is provided or not.
Context
Process#communicate depends on Ruby's IO#read_nonblock via Process#drain_fd. Strings returned from read_nonblock are handled as ASCII-8BIT to prevent data loss. (For a demonstration of this behavior, download and run this snippet.) If output from a subprocess has a different encoding, this can lead to decoding errors.
This results in an unexpected (at least for me) discrepancy between the encoding settings of output from Process#communicate, depending on the form of the method invoked. Any consumer that depends on output from partially completed commands must provide a block (see PR #36), and so becomes subject to potential decoding errors.
Design
This branch resolves the issue by re-applying captured encoding settings before yielding output to the provided block.
One consideration: as written here, encoding settings will be re-applied every time the method attempts to read output. As noted in this comment on PR #36, this step can be omitted when no output is present on STDOUT or STDERR. Since performing this step on an empty string is a no-op, and the previous implementation does not perform this check before re-applying encoding settings, the proposed changes favor readability.
Testing
This branch modifies the "multiwrite script" in the test file to include non-ASCII characters, and updates dependent assertions. This modification reveals the potential for decoding errors when providing a block to Process#communicate.
Note that the modified tests fail without the included changes, but pass with the included changes.
Purpose
Handle encoding of output from
Process#communicate
identically whether a block is provided or not.Context
Process#communicate
depends on Ruby'sIO#read_nonblock
viaProcess#drain_fd
. Strings returned fromread_nonblock
are handled asASCII-8BIT
to prevent data loss. (For a demonstration of this behavior, download and run this snippet.) If output from a subprocess has a different encoding, this can lead to decoding errors.When not provided a block,
Process#communicate
handles this by capturing encoding settings at the beginning of the method and re-applying these settings at the end of the method. However, when provided a block this method yields output before re-applying these settings, so the output still retains theASCII-8BIT
settings applied byread_nonblock
.This results in an unexpected (at least for me) discrepancy between the encoding settings of output from
Process#communicate
, depending on the form of the method invoked. Any consumer that depends on output from partially completed commands must provide a block (see PR #36), and so becomes subject to potential decoding errors.Design
This branch resolves the issue by re-applying captured encoding settings before yielding output to the provided block.
One consideration: as written here, encoding settings will be re-applied every time the method attempts to read output. As noted in this comment on PR #36, this step can be omitted when no output is present on
STDOUT
orSTDERR
. Since performing this step on an empty string is a no-op, and the previous implementation does not perform this check before re-applying encoding settings, the proposed changes favor readability.Testing
This branch modifies the "multiwrite script" in the test file to include non-ASCII characters, and updates dependent assertions. This modification reveals the potential for decoding errors when providing a block to
Process#communicate
.Note that the modified tests fail without the included changes, but pass with the included changes.
"你好 世界" is Google's translation of "Hello World" into Chinese.