python / cpython

The Python programming language
https://www.python.org
Other
63.41k stars 30.36k forks source link

Use backslashreplace in pprint #63299

Open serhiy-storchaka opened 11 years ago

serhiy-storchaka commented 11 years ago
BPO 19100
Nosy @freddrake, @doerwalter, @pitrou, @vstinner, @ezio-melotti, @vadmium, @serhiy-storchaka
Files
  • pprint_unencodable.patch
  • pprint_unencodable_2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = 'https://github.com/freddrake' closed_at = None created_at = labels = ['type-bug', 'library', 'expert-unicode'] title = 'Use backslashreplace in pprint' updated_at = user = 'https://github.com/serhiy-storchaka' ``` bugs.python.org fields: ```python activity = actor = 'serhiy.storchaka' assignee = 'fdrake' closed = False closed_date = None closer = None components = ['Library (Lib)', 'Unicode'] creation = creator = 'serhiy.storchaka' dependencies = [] files = ['31881', '33084'] hgrepos = [] issue_num = 19100 keywords = ['patch'] message_count = 13.0 messages = ['198465', '204952', '205846', '205902', '205907', '206178', '206200', '239650', '239692', '239742', '239756', '308600', '308618'] nosy_count = 7.0 nosy_names = ['fdrake', 'doerwalter', 'pitrou', 'vstinner', 'ezio.melotti', 'martin.panter', 'serhiy.storchaka'] pr_nums = [] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue19100' versions = ['Python 3.3', 'Python 3.4'] ```

    serhiy-storchaka commented 11 years ago

    Currently pprint.pprint() fails on unencodable characters.

    $ LANG=en_US.utf8 ./python -c "import pprint; pprint.pprint('\u20ac')"
    '€'
    $ LANG= ./python -c "import pprint; pprint.pprint('\u20ac')"
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/home/serhiy/py/cpython/Lib/pprint.py", line 56, in pprint
        printer.pprint(object)
      File "/home/serhiy/py/cpython/Lib/pprint.py", line 137, in pprint
        self._format(object, self._stream, 0, 0, {}, 0)
      File "/home/serhiy/py/cpython/Lib/pprint.py", line 274, in _format
        write(rep)
    UnicodeEncodeError: 'ascii' codec can't encode character '\u20ac' in position 1: ordinal not in range(128)

    This is a regression from Python 2 in which repr() always returns ascii string.

    $ LANG= python2.7 -c "import pprint; pprint.pprint(u'\u20ac')"
    u'\u20ac'

    Perhaps pprint() should use the backslashreplace error handler (as sys.displayhook()). With the proposed patch:

    $ LANG= ./python -c "import pprint; pprint.pprint('\u20ac')"
    '\u20ac'
    serhiy-storchaka commented 10 years ago

    Any review?

    serhiy-storchaka commented 10 years ago

    In new patch wrapping stream is moved to PrettyPrinter constructor.

    doerwalter commented 10 years ago

    This is not the fault of pprint. IMHO it doesn't make sense to fix anything here, at least not for pprint specifically. print() has the same "problem":

       $ LANG= ./python -c "print('\u20ac')"                                                                                                                     
       Traceback (most recent call last):
         File "<string>", line 1, in <module>
       UnicodeEncodeError: 'ascii' codec can't encode character '\u20ac' in position 0: ordinal not in range(128)
    serhiy-storchaka commented 10 years ago

    pprint is not print.

    >>> print('\u20ac')
    €
    >>> import pprint; pprint.pprint('\u20ac')
    '€'

    Default sys.displayhook doesn't fail on unencodable output.

    $ LANG=C ./python
    Python 3.4.0b1 (default:e961a166dc70+, Dec 11 2013, 13:57:17) 
    [GCC 4.6.3] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> '\u20ac'
    '\u20ac'
    doerwalter commented 10 years ago

    sys.displayhook doesn't fail, because it uses the backslashreplace error handler, and for sys.displayhook that's OK, because it's only used for screen output and there some output is better than no output. However print and pprint.pprint might be used for output that is consumed by other programs (via pipes etc.) and IMHO in this case "Errors should never pass silently."

    serhiy-storchaka commented 10 years ago

    The purpose of pprint.pprint() is to produce human-readable output. In this case some output is better than nothing. It isn't designed to be parseable by other programs, because sometimes it is even less accurate than the result of repr() (pprint() truncates long reprs and losses information for dict subclasses). Also result of pprint() can be changed from version to version (e.g. bpo-17150). The main source of non-ASCII characters is string reprs and for them the backslashreplace error handler doesn't lose information. And pprint.pprint() is mainly used for screen output too.

    vadmium commented 9 years ago

    I agree with Serhiy that using a permissive error handler with pprint() is appropriate.

    What is the reasoning behind the DecodeWriter case, where the original stream has an interesting encoding, but “buffer” is None? Are there any real-world cases like that? Your mock test case sets encoding="latin1" with no buffer, but that class will also write non-latin1 strings, so there is no problem.

    Also I wonder if flushing the stream once or twice for each pprint() call is a wise move.

    Another way to tackle this might be a function that translates the non-Latin-1 or whatever characters, allowing the original write() or whatever method to still be used. Here is a Python 2 and 3 compatible attempt: \https://bitbucket.org/Gfy/pyrescene/src/560cafe/rescene/utility.py#cl-426\. Python 3 only version: \https://github.com/vadmium/python-iview/commit/68b0559\. This function is originally used for printing descriptive comments to stdout (alongside other text where the “strict” error handler is appropriate). But I think it could be generally usable for pprint(), sys.displayhook(), etc as well.

    doerwalter commented 9 years ago

    The linked code at https://github.com/vadmium/python-iview/commit/68b0559 seems strange to me:

    try:
        text.encode(encoding, textio.errors or "strict")
    except UnicodeEncodeError:
        text = text.encode(encoding, errors).decode(encoding)
    return text

    is the same as:

        return text.encode(encoding, errors).decode(encoding)

    because when there are no unencodable characters in text, the error handler will never be invoked.

    serhiy-storchaka commented 9 years ago

    What is the reasoning behind the DecodeWriter case, where the original stream has an interesting encoding, but “buffer” is None? Are there any real-world cases like that?

    sys.stdout and sys.stderr in IDLE.

    vadmium commented 9 years ago

    Walter: the first line encoding with textio.errors is meant to handle the case where the output stream already has its own permissive error handler set. But anyway I was just trying to point out that it might be better to do the backslash escaping at the text level, and write the escaped text string to the original stream.

    Serhiy: thanks for pointing out IDLE’s stdout. It seems the encoding can be set to say ASCII by the locale, yet it still accepts non-ASCII text. But I guess that’s a separate issue.

    I haven’t tested the patch, but reading it, I think the there may be a couple of problems:

    bpo-15216 is slightly related, and has a patch apparently allowing the encoding and error handler to be changed on a text stream. But I guess it is no good here because you need backwards compatibility with other non-TextIOWrapper streams.

    vstinner commented 6 years ago
    $ LANG= ./python -c "import pprint; pprint.pprint('\u20ac')"

    Thanks to the PEP-538 and PEP-540, this command now works as expected in Python 3.7:

    vstinner@apu$ LANG= python3.7 -c "import pprint; pprint.pprint('\u20ac')" '€'

    Do we still need pprint_unencodable_2.patch workaround?

    serhiy-storchaka commented 6 years ago

    Try with LANG=en_US.

    And even UTF-8 can fail.

    devdanzin commented 5 months ago

    This sounds like a good safeguard against encoding errors in theory, but the problematic examples don't trigger issues for me. Is there still a need for this?

    vstinner commented 5 months ago

    Ah right, I can reproduce on Python 3.14 with a locale using Latin1 as the locale encoding:

    $ LANG=en_US ./python -c "import pprint; pprint.pprint('\u20ac')"
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
        import pprint; pprint.pprint('\u20ac')
                       ~~~~~~~~~~~~~^^^^^^^^^^
      File "/home/vstinner/python/main/Lib/pprint.py", line 55, in pprint
        printer.pprint(object)
        ~~~~~~~~~~~~~~^^^^^^^^
      File "/home/vstinner/python/main/Lib/pprint.py", line 156, in pprint
        self._format(object, self._stream, 0, 0, {}, 0)
        ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/home/vstinner/python/main/Lib/pprint.py", line 197, in _format
        stream.write(rep)
        ~~~~~~~~~~~~^^^^^
    UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 1: ordinal not in range(256)
    devdanzin commented 5 months ago

    Can confirm the issue and that @serhiy-storchaka's code works as-is. Should I submit a PR so he can take ownership?