python / cpython

The Python programming language
https://www.python.org
Other
63.71k stars 30.52k forks source link

subprocess seems to use local encoding and give no choice #50385

Closed a09e2537-b6b9-4978-9d1d-78b1db358cbc closed 8 years ago

a09e2537-b6b9-4978-9d1d-78b1db358cbc commented 15 years ago
BPO 6135
Nosy @gpshead, @amauryfa, @ncoghlan, @pitrou, @vstinner, @mark-summerfield, @merwok, @bitdancer, @mightyiam, @andyclegg, @cjerdonek, @vadmium, @eryksun, @zooba, @davispuh
PRs
  • python/cpython#5564
  • python/cpython#5572
  • python/cpython#5573
  • Files
  • subprocess.patch: Add encoding and errors to subprocess.Popen. Based against trunk.
  • subprocess3.patch: Add encoding and errors to subprocess.Popen. Based against py3k.
  • test_subprocess3.py.patch: unittests for encoding parameter
  • subProcessTest.py: Workaround example. Calls itself as a sub-process.
  • 6135_1.patch
  • 6135_2.patch
  • 6135_3.patch
  • 6135_4.patch
  • 6135_5.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = 'https://github.com/zooba' closed_at = created_at = labels = ['type-feature', 'library'] title = 'subprocess seems to use local encoding and give no choice' updated_at = user = 'https://github.com/mark-summerfield' ``` bugs.python.org fields: ```python activity = actor = 'gregory.p.smith' assignee = 'steve.dower' closed = True closed_date = closer = 'steve.dower' components = ['Library (Lib)'] creation = creator = 'mark' dependencies = [] files = ['14293', '14294', '14295', '28760', '44386', '44389', '44398', '44404', '44408'] hgrepos = [] issue_num = 6135 keywords = ['patch'] message_count = 66.0 messages = ['88466', '89094', '89293', '89322', '89325', '89332', '89333', '97090', '97092', '97093', '111181', '111466', '123020', '123024', '148025', '148066', '148484', '148494', '148495', '148496', '148498', '168191', '168213', '180157', '265785', '265793', '265812', '265822', '274467', '274468', '274469', '274481', '274489', '274492', '274497', '274509', '274510', '274517', '274518', '274519', '274561', '274562', '274566', '274574', '274642', '274644', '274646', '274648', '274649', '274651', '274653', '274665', '274699', '274736', '274737', '282291', '282294', '282298', '304049', '304060', '304118', '304119', '304126', '311756', '311757', '311760'] nosy_count = 20.0 nosy_names = ['gregory.p.smith', 'amaury.forgeotdarc', 'ncoghlan', 'pitrou', 'vstinner', 'mark', 'eric.araujo', 'segfaulthunter', 'Arfrever', 'r.david.murray', 'srid', 'mightyiam', 'andrewclegg', 'chris.jerdonek', 'python-dev', 'martin.panter', 'eryksun', 'steve.dower', 'berwin22', 'davispuh'] pr_nums = ['5564', '5572', '5573'] priority = 'normal' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue6135' versions = ['Python 3.6'] ```

    a09e2537-b6b9-4978-9d1d-78b1db358cbc commented 15 years ago

    When I start a process with subprocess.Popen() and pipe the stdin and stdout, it always seems to use the local 8-bit encoding.

    I tried setting process.stdin.encoding = "utf8" and the same for stdout (where process is the subprocess object), but to no avail.

    I also tried using shell=True since on Mac, Terminal.app is fine with Unicode, but that didn't work.

    So basically, I have programs that output Unicode and run fine on the Mac terminal, but that cannot be executed by subprocess because subprocess uses the mac_roman encoding instead of Unicode.

    I wish it were possible to specify the stdin and stdout encoding that is used; then I could use the same one on all platforms. (But perhaps it is possible, and I just haven't figured out how?)

    5bf59da1-86ec-4384-896c-68455c40bd5f commented 15 years ago

    Related discussion thread: https://answers.launchpad.net/bzr/+question/63601

    amauryfa commented 15 years ago

    I propose to add two parameters (encoding, error) to the subprocess.Popen function.

    I'll try to come with a patch.

    d20267a8-d9fd-4a0f-adc7-c50e8ee29c82 commented 15 years ago

    I wrote a patch to add encoding and error to subprocess.Popen in Python 2.7 (trunk).

    d20267a8-d9fd-4a0f-adc7-c50e8ee29c82 commented 15 years ago

    Cosmetic update.

    pitrou commented 15 years ago

    Two things:

    1. The argument should be called errors for consistency with open() and TextIOWrapper(), not error
    2. You should add some unit tests.
    d20267a8-d9fd-4a0f-adc7-c50e8ee29c82 commented 15 years ago

    Should we also cover the unusual case where stdout, stderr and stdin have different encodings, because now we are assuming the are all the same.

    a09e2537-b6b9-4978-9d1d-78b1db358cbc commented 14 years ago

    I agree with Florian Mayer that the encoding handling should be stream-specific. You could easily be reading the stdout of some third party program that uses, say, latin1, but want to do your own output in, say, utf-8.

    One solution that builds on what Amaury Forgeot d'Arc has done (i.e., the encoding and errors parameters) by allowing those parameters to accept either a single string (as now), or a dict with keys 'stdin', 'stdout', 'stderr'. Of course it is possible that the client might not specify all the dict's keys in which case those would use the normal default (local 8-bit etc.)

    amauryfa commented 14 years ago

    I don't understand. How is the subprocess stdout related to the main program output? Stream-specific encoding could be useful for subprocesses that expect latin-1 from stdin but write utf-8 to stdout. I'm not sure we should support this.

    a09e2537-b6b9-4978-9d1d-78b1db358cbc commented 14 years ago

    On Thu, Dec 31, 2009 at 1:30 PM, Amaury Forgeot d'Arc \report@bugs.python.org\ wrote:

    Amaury Forgeot d'Arc \amauryfa@gmail.com\ added the comment:

    I don't understand. How is the subprocess stdout related to the main program output? Stream-specific encoding could be useful for subprocesses that expect latin-1 from stdin but write utf-8 to stdout. I'm not sure we should support this.

    Yes, you're right.

    (What I had in mind was a scenario where you read one process's stdout and wrote to another process's stdin; but of course using your errors & encoding arguments this will work because there'll be two separate process objects each of which can have its encoding and errors set separately.)

    83d2e70e-e599-4a04-b820-3814bbdb9bef commented 14 years ago

    Ran new unit test before and after patching subprocess on Windows Vista against 3.1 debug maintenance release, all ok apart from this at end of latter.

    File "test\test_subprocess.py", line 568, in test_encoded_stderr self.assertEqual(p.stderr.read(), send) AssertionError: 'ï[32943 refs]\r\n' != 'ï'

    I'm sure I've seen a ref to this somewhere, can anyone remember where?

    83d2e70e-e599-4a04-b820-3814bbdb9bef commented 14 years ago

    In 2.7 and py3k test_subprocess has a class BaseTestCase which has a assertStderrEqual method. These don't exist in 3.1. I believe that the py3k code will need to be backported to 3.1. Can this be done on this issue, or do we need a new one to keep things clean?

    vstinner commented 14 years ago

    About the topic:

    subprocess seems to use local 8-bit encoding and gives no choice I don't understand that: by default, Python 2 and Python 3 use byte strings, so there is no encoding (nor error handler).

    I don't see how you can get unicode from a process only using subprocess. But with Python 3, you can get unicode if you set universal_newlines option to True.

    So for Python 2, it's a new feature (get unicode), and for Python 3, it's a new option to specify the encoding. The title should be changed to something like "subprocess: add an option to specify stdin, stdout and/or stderr encoding and errors" and the type should be changed to "feature request".

    Or am I completly wrong?

    vstinner commented 14 years ago

    ... it always seems to use the local 8-bit encoding

    The locale encoding is not necessary a 8-bit encoding, it can by a multibyte like... UTF-8 :-)

    --

    subprocess.patch: You should maybe use io.open(process.stdout.fileno(), encoding=..., errors=...) instead of codecs.getreader/getwriter. The code will be closer to Python 3. I think that the io module has a better support of unicode than codec reader/writer objects and a nicer API. See: http://bugs.python.org/issue8796#msg106339

    --

    ... allowing [encoding and errors] to accept either a single string (as now), or a dict with keys 'stdin', 'stdout', 'stderr'

    I like this idea. But what about other TextIOWrapper (or other file classes) options: buffer size, newline, line_buffering, etc.?

    Why not using a dict for existing stdin, stdout and stderr arguments? Dummy example:

    process = Popen(
       command,
       stdin={'file': PIPE, 'encoding': 'iso-8859-1', 'newline': False},
       stdout={'file': PIPE', 'encoding': 'utf-8', 'buffering': 0, 'line_buffering': False},
       ...)

    If stdin, stdout or stderr is a dict: the default value of its 'file' key can be set to PIPE. I don't think that it's possible to choose the encoding, buffer size, or anything else if stdin, stdout or stderr is not a pipe.

    With this solution, you cannot specify the encoding for stdin, stdout and stderr at once. You have at least to repeat the encoding for stdin and stdout (and use stderr=STDOUT).

    --

    I still hesitate to accept this feature request. Is it really needed to add extra arguments for TextIOWrapper? Can't the developer create its own TextIOWrapper object with all interesting options?

    In Python 3, be able to specify stdout encoding is an interesting feature. Control the buffers size is also an interesting option.

    My problem is maybe the usage of a dict to specify various options. I'm not sure that it is extensible to support future needs.

    ncoghlan commented 13 years ago

    I discovered this same problem recently when updating the subprocess docs, and also in working on the improved shell invocation support I am proposing for 3.3 (bpo-13238).

    I initially posted an earlier variant this suggestion as a new issue (bpo-13442), but Victor redirected me here.

    Firstly, I don't think it makes any sense to set encoding information globally for the Popen object. As a simple example, consider using Python to write a test suite for the iconv command line tool: there's only one Popen instance (for the iconv call), but different encodings for stdin and stdout.

    Really, we want to be able to make full use of Python 3's layered I/O model, but we want the subprocess pipe instances to be slotted in at the lowest layer rather than creating them ourselves.

    The easiest way to do that is to have a separate class that specifies the additional options for pipe creation and does the wrapping:

        class TextPipe:
            def __init__(self, *args, **kwds):
                self.args = args
                self.kwds = kwds
            def wrap_pipe(self, pipe):
                return io.TextIOWrapper(pipe, *self.args, **self.kwds)

    The stream creation process would then include a new "wrap = getattr(stream_arg, 'wrap_pipe', None)" check that is similar to the existing check for subprocess.PIPE, but invokes the method to wrap the pipe after creating it.

    So to read UTF-8 encoded data from a subprocess, you could just do:

        data = check_stdout(cmd, stdout=TextPipe('utf-8'), stderr=STDOUT)
    pitrou commented 13 years ago

    Firstly, I don't think it makes any sense to set encoding information globally for the Popen object. As a simple example, consider using Python to write a test suite for the iconv command line tool: there's only one Popen instance (for the iconv call), but different encodings for stdin and stdout.

    Isn't that the exception rather than the rule? I think it actually makes sense, in at least 99.83% of cases ;-), to have a common encoding setting for all streams. (I'm not sure about the "errors" setting, though: should we use strict for stdin/stdout and backslashreplace for stderr, as the interpreter does?)

    Perhaps the common case should be made extra easy.

    merwok commented 13 years ago

    Is subprocess affected by PYTHONIOENCODING?

    vstinner commented 13 years ago

    Is subprocess affected by PYTHONIOENCODING?

    Yes, as any Python process.

    merwok commented 13 years ago

    So the users can control the encoding, and this is a doc bug.

    bitdancer commented 13 years ago

    If you decide this is only a doc bug, please see also related bpo-12832.

    pitrou commented 13 years ago

    So the users can control the encoding, and this is a doc bug.

    Not really. People can control the encoding in the child process (and only if it's a Python 3 process of course). They can't control the encoding in the parent's subprocess pipes and that's what the request (& patch) is about.

    cjerdonek commented 12 years ago

    > only one Popen instance (for the iconv call), but different encodings > for stdin and stdout. Isn't that the exception rather than the rule? I think it actually makes sense, in at least 99.83% of cases ;-), to have a common encoding setting for all streams.

    FWIW, I recently encountered a scenario (albeit in a test situation) where the ability to set different encodings for stdout and stderr would have been useful to me. It was while creating a test case for bpo-15595. I was changing the locale encoding for stdout, but I also wanted to leave it unchanged for stderr because there didn't seem to be a way to control the encoding that the child used for stderr.

    cjerdonek commented 12 years ago

    To my previous comment, bpo-15648 shows the case where I was able to change the encoding for stdout in the child process but not stderr (which would require supporting two encodings in Popen to handle).

    ec0bf251-ec2a-454a-9ad2-e266525bc495 commented 11 years ago

    I've found a workaround by specifying the enviroment variable:

    my_env = os.environ
    my_env['PYTHONIOENCODING'] = 'utf-8'
    p = subprocess.Popen(shlex.split(command), stdout=subprocess.PIPE, stderr=subprocess.PIPE, stdin=subprocess.PIPE, env=my_env)

    I've attached an example script for testing. It calls itself recursively 10 times. Pleased note the 'fix' variable.

    vadmium commented 8 years ago

    This seems like a request for extra feature(s), rather than a bug report.

    FWIW I agree with Victor’s hesitation in \https://bugs.python.org/issue6135#msg123024\. I doubt it is worth adding special support to have multiple pipes that all use text mode (universal_newlines=True) but with different encoding settings. If you really want this, just add an external TextIOWrapper, or string encoding:

    process = Popen(command, stdin=PIPE, stdout=PIPE, ...)
    
    # Manually wrap with TextIOWrapper. Probably risks deadlocking in the general case, due to buffering, so this is a bad idea IMO.
    input_writer = io.TextIOWrapper(process.stdin, "iso-8859-1")
    output_reader = io.TextIOWrapper(process.stdout, "utf-8")
    
    # Better: use communicate(), or a custom select() loop or similar:
    input_string = input_string.encode("iso-8859-1")
    [output_string, _] = process.communicate(input_string)
    output_string = output_string.decode("utf-8")

    Also I’m not enthusiastic about adding encoding etc parameters for a single PIPE scenario. What is wrong with leaving it in binary mode in the Popen call, and encoding, decoding, or using TextIOWrapper as a separate step?

    vadmium commented 8 years ago

    I added some review comments for Florian’s patches. Also, I suspect the code will not affect the text encoding when communicate() is used with multiple PIPE streams.

    ncoghlan commented 8 years ago

    Since this issue was first filed, the "opener" parameter was added to the open() builtin: https://docs.python.org/3/library/functions.html#open

    I mention that, as the problem there is similar to the problem here: wanting to construct a buffered text-mode interface, but with a different stream at the lowest layer (whatever "opener" returns for the builtin open(), the implicitly created pipes for the subprocess module).

    To answer Martin's question about "Why not just wrap things manually?", the main issue with that is that it doesn't work for the check_output helper function - to get text out of that, currently your only option is to enable universal newlines and trust that the output is encoded as UTF-8 (which is fortunately often a valid assumption on modern systems).

    vadmium commented 8 years ago

    Here is a tested illustration of how I would suggest to manually handle encoding with check_output():

    >>> text_input = "©"
    >>> args = ("iconv", "--from-code", "ISO_8859-1", "--to-code", "UTF-8")
    >>> bytes_output = check_output(args, input=text_input.encode("iso-8859-1"))
    >>> bytes_output
    b'\xc2\xa9'
    >>> bytes_output.decode("utf-8")
    '©'

    If you wanted actual universal newline translation (which is more than plain text encoding), it would be more complicated, but still possible.

    But I am not hugely against adding new “encoding” and “errors” parameters. The reasons against are that I don’t see it as necessary, and there are already a lot of subprocess-specific options to wade through.

    zooba commented 8 years ago

    I'll prepare a patch for the following:

    EITHER

    Any thoughts?

    vstinner commented 8 years ago
    • add 'oem' encoding to make it easy to choose (but don't try and guess when it is used)

    I suggest to open a separated issue for that. By the way, you might also add "ansi"? See also the aliasmbcs() function of the site module.

    Note: I never liked that aliasmbcs() is implemented in Python and is "optional" (not done with using python3 -S).

    zooba commented 8 years ago

    'mbcs' is exactly equivalent to what 'ansi' would be, so that's just a matter of knowing the name. I'm okay with adding an alias for it though.

    zooba commented 8 years ago

    Initial patch attached - tests to follow.

    vstinner commented 8 years ago

    I'm not sure that about "advanced API" to specify an encoding per stream, or even change other parameters like buffering or newlines.

    I suggest to start with the least controversal part: add encoding and errors and only accept a string.

    More patches can come later.

    zooba commented 8 years ago

    Added tests.

    zooba commented 8 years ago

    You may be right about leaving out the opener API. The only use of it right now is for separate encodings, but I don't know how valuable that is.

    I'll pull it out tomorrow and just leave the encoding parameter.

    vadmium commented 8 years ago

    Would be nice to see tests for getstatusoutput() and the errors parameter. Also you need more error handling e.g. if the encoding is unsupported, I think the internal pipe files won’t be cleaned up.

    Also, should encoding=... or errors=... be an error if universal_newlines=False (the default)?

    ncoghlan commented 8 years ago

    A couple of high level questions:

    a. How do folks feel about providing a new "text" parameter to replace the cryptic "universal_newlines=True" that would explicitly be equivalent to "universal newlines with sys.getdefaultencoding()"?

    b. Given (a), what if the new "text" parameter also accepted a new "subprocess.TextConfig" object in addition to the base behaviour of doing a plain bool(text) check to activate the defaults?

    One of the biggest problems we have with subprocess.Popen is that it still attempts to use a flat configuration model for a complex operation involving 4-7 underlying elements (starting the subprocess, configuring stdin, configuring stdout, configuring stderr, and potentially local counterparts for stdin/stdout/stderr if pipes are used), and that situations unlikely to ever improve unless we start introducing some hierarchical elements to the configuration that better mirror the structure of the underlying system capabilities.

    vstinner commented 8 years ago

    Steve:

    You may be right about leaving out the opener API. The only use of it right now is for separate encodings, but I don't know how valuable that is.

    My proposal is: Popen(cmd, stdin={'encoding': 'oem'}, stdout={'encoding': 'ansi'})

    The dict would just be passed to TextIOWrapper, so you can set even more arguments:

    But I still think that simple encoding + errors arguments should be added for the common case : Popen(cmd, encoding='utf8').

    You can combine options: Popen(cmd, stdin={'encoding': 'oem'}, stdout=subprocess.PIPE, stderr=subprocess.PIPE, encoding='ansi'): stdout and stderr use the ANSI code page.

    vstinner commented 8 years ago

    Martin Panter added the comment:

    Also, should encoding=... or errors=... be an error if universal_newlines=False (the default)?

    Right. But if encoding or errors is used, universal_newlines value should be set automatically to True.

    For example, I expect Unicode when writing:

    output = subprocess.call(cmd, stdout=subprocess.PIPE, encoding='utf8').stdout
    vstinner commented 8 years ago

    Nick Coghlan added the comment:

    a. How do folks feel about providing a new "text" parameter to replace the cryptic "universal_newlines=True" that would explicitly be equivalent to "universal newlines with sys.getdefaultencoding()"?

    If it's just text=True, I don't see the point of having two options with the same purpose.

    b. Given (a), what if the new "text" parameter also accepted a new "subprocess.TextConfig" object in addition to the base behaviour of doing a plain bool(text) check to activate the defaults?

    Can you give an example? I don't see how the API would be used.

    zooba commented 8 years ago

    Given specifying an encoding will do the same thing as universal_newlines would have, should I just "hide" references to universal_newlines in the doc? (i.e. only mention it under a versionchanged banner, rather than front-and-centre)

    ncoghlan commented 8 years ago

    Ah, excellent point about "encoding='utf-8'" already being a less cryptic replacement for universal_newlines=True, so consider my questions withdrawn :)

    As far as universal_newlines goes, that needs to remain documented for the sake of folks reading code that uses it, and folks that need compatibility with older versions, but +1 for recommending specifying an encoding over using that flag for new 3.6+ code.

    vstinner commented 8 years ago

    Steve Dower added the comment:

    Given specifying an encoding will do the same thing as universal_newlines would have, should I just "hide" references to universal_newlines in the doc? (i.e. only mention it under a versionchanged banner, rather than front-and-centre)

    No. Don't hide universal_newlines, it's different than encoding. universal_newlines uses the locale encoding which is a good choice in most cases.

    zooba commented 8 years ago

    universal_newlines uses the locale encoding which is a good choice in most cases.

    Well, some cases, and only really by accident. I won't hide the parameter, but it is promoted as "frequently used" and I'll make sure to document encoding before universal_newlines. Call it a very weak deprecation.

    The new patch also removes the openers, which I think were solving a problem we don't have right now.

    zooba commented 8 years ago

    Addressed more feedback.

    vstinner commented 8 years ago

    6135_4.patch: Hum, what if you only set errors? I suggest to use TextIOWrapper but use the locale encoding:

    if (universal_newlines or errors) and not encoding:
      encoding = getpreferredencoding(False)
    zooba commented 8 years ago

    Sure, that's easy enough. Any other concerns once I've made that change?

    vstinner commented 8 years ago

    Steve Dower added the comment:

    Sure, that's easy enough. Any other concerns once I've made that change?

    If errors enables Unicode and the doc is updated, the patch will LTGM :-)

    eryksun commented 8 years ago

    Why do you need to call getpreferredencoding()? Isn't that already the default if you call TextIOWrapper with encoding as None? For example:

        text_mode = encoding or errors or universal_newlines
    
        self.stdin = io.open(p2cwrite, 'wb', bufsize)
        if text_mode:
            self.stdin = io.TextIOWrapper(self.stdin, write_through=True,
                                          line_buffering=(bufsize == 1),
                                          encoding=encoding, errors=errors)
    vstinner commented 8 years ago

    Why do you need to call getpreferredencoding()?

    I proposed to do that, but I prefer your simple flag:

    text_mode = encoding or errors or universal_newlines

    Steve: please use this simpler flag to avoid TextIOWrapper details in subprocess.py ;-)