stripe / subprocess

A port of Python's subprocess module to Ruby
MIT License
209 stars 17 forks source link

Handle non-ASCII output from commands #20

Closed metcalf closed 8 years ago

metcalf commented 8 years ago

Subprocess always returns output strings encoded as ASCII-8BIT even if the IO uses a different encoding. read_nonblock really can't do anything else since it's reading raw bytes and might split up Unicode characters. Instead, we need to set the encoding of the complete string after we're done receiving it. external_encoding isn't readable on closed IOs so we have to read that at the beginning of communicate.

metcalf commented 8 years ago

r? @zenazn

zenazn commented 8 years ago

LGTM

metcalf commented 8 years ago

I'll deploy this internally but I don't have Rubygems bits so I won't deploy publicly.

doy commented 8 years ago

Am I missing something, or is this pretty backwards-incompatible? What would happen to someone currently using something like check_output(['gzip', whatever])?

metcalf commented 8 years ago

Hmm, I think you're right that would break but I think I'd rather break that than have Subprocess perpetually mishandle non-ASCII character sets by default. I don't really see a way of fixing that without introducing that incompatibility. This won't be a problem if you pass in your own pipe/file for stdout with the correct encoding so it's still very much possible to use Subprocess for binary things.

doy commented 8 years ago

Yeah, I'm not saying that this behavior is wrong, I'm just concerned because we have a looooot of untested subprocess-using code in places like puppet-config - I'm fairly sure that this will break artifact generation in Jenkins, for instance (which does a similar thing to above). At the very least we should bump the version to 2.0 since this is a breaking API change in a public open source module, but I think we'll also need to be very careful with the rollout (puppet-config does ensure => latest a lot when installing global gems, among other things).

Alternately, we could require the encoding to be specified explicitly if it's not binary? It wouldn't be great, but it would avoid breaking things.

metcalf commented 8 years ago

@doy I wrote up a little test script that reproduces the behavior of our build script and confirmed that this change does not break it:

# coding: utf-8
require 'rubygems'

gem 'subprocess', '1.3.2'

require 'subprocess'
require 'fileutils'
require 'tmpdir'

CONTENTS = "I'm a little teåpot"

Dir.mktmpdir do |tmpdir|
  indir = File.join(tmpdir, 'input')
  outdir = File.join(tmpdir, 'output')
  outfile = File.join(tmpdir, 'output.tgz')

  FileUtils.mkdir_p(indir)
  FileUtils.mkdir_p(outdir)
  File.write(File.join(indir, 'myfile'), CONTENTS)

  pigz = Subprocess.popen(
    ['pigz'],
    stdout: File.open(outfile, 'w'),
    stdin: Subprocess::PIPE
  )
  tar = Subprocess.popen(
    ['tar', '-c', '-C', indir, '--exclude=.git', '.'],
    stdout: pigz.stdin
  )
  pigz.stdin.close

  [pigz, tar].each do |p|
    unless p.wait.success?
      raise Subprocess::NonZeroExit.new(p.command, p.status)
    end
  end

  Subprocess.check_call(
    %W{tar -xz -C #{outdir} -f #{outfile}},
  )

  puts File.read(File.join(outdir, 'myfile'))
end
metcalf commented 8 years ago

I also confirmed that this works even if you don't pipe things directly from one process to another but read them in memory in Ruby first:

# coding: utf-8
require 'rubygems'

gem 'subprocess', '1.3.2'

require 'subprocess'
require 'fileutils'
require 'tmpdir'

CONTENTS = "I'm a little teåpot"

Dir.mktmpdir do |tmpdir|
  indir = File.join(tmpdir, 'input')
  outdir = File.join(tmpdir, 'output')
  outfile = File.join(tmpdir, 'output.tgz')

  FileUtils.mkdir_p(indir)
  FileUtils.mkdir_p(outdir)
  File.write(File.join(indir, 'myfile'), CONTENTS)

  tarred = Subprocess.check_output(
    ['tar', '-c', '-C', indir, '--exclude=.git', '.'],
    stdout: Subprocess::PIPE
  )

  Subprocess.check_call(
    ['pigz'],
    stdout: Subprocess::PIPE,
    stdin: Subprocess::PIPE
  ) do |p|
    out, = p.communicate(tarred)
    File.write(outfile, out)
  end

  Subprocess.check_call(
    %W{tar -xz -C #{outdir} -f #{outfile}},
  )

  puts File.read(File.join(outdir, 'myfile'))
end
doy commented 8 years ago

Okay, that's a bit more reassuring. I still think that we should bump the version to 2.0.0 and be a bit careful about the rollout, but it's probably a reasonable thing to do.

metcalf commented 8 years ago

I'll let @zenazn weigh in on the version bump. Given that I can't repro any backwards incompatibility here, I don't think it's justified.

doy commented 8 years ago

Well, there are clearly things that will behave differently if given a utf8-encoded string of characters instead of the equivalent string of bytes - it's just that writing them to a file or writing them back into an input pipe of another subprocess aren't those things, since those will always be interpreted as bytes regardless. Depends on how strictly you care about following semantic versioning, I guess?