python / cpython

The Python programming language
https://www.python.org
Other
63.2k stars 30.26k forks source link

reject unicode in zlib #49007

Closed vstinner closed 14 years ago

vstinner commented 15 years ago
BPO 4757
Nosy @malemburg, @pitrou, @vstinner, @florentx
Files
  • zlib_bytes.patch
  • issue4757_zlib_bytes_v2.diff: Patch, apply to py3k
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = 'https://github.com/pitrou' closed_at = created_at = labels = ['extension-modules', 'type-bug'] title = 'reject unicode in zlib' updated_at = user = 'https://github.com/vstinner' ``` bugs.python.org fields: ```python activity = actor = 'vstinner' assignee = 'pitrou' closed = True closed_date = closer = 'pitrou' components = ['Extension Modules'] creation = creator = 'vstinner' dependencies = [] files = ['12472', '15556'] hgrepos = [] issue_num = 4757 keywords = ['patch'] message_count = 15.0 messages = ['78356', '78357', '78360', '78365', '79090', '79110', '79124', '79225', '96376', '96378', '96382', '96383', '96384', '96389', '97491'] nosy_count = 5.0 nosy_names = ['lemburg', 'pitrou', 'vstinner', 'ebfe', 'flox'] pr_nums = [] priority = 'normal' resolution = 'fixed' stage = None status = 'closed' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue4757' versions = ['Python 3.1', 'Python 3.2'] ```

    vstinner commented 15 years ago

    Python 2.x allows to encode any byte string (str) and ASCII unicode string (unicode):

    $ python
    Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:40)
    >>> import zlib
    >>> zlib.compress('abc')
    "x\x9cKLJ\x06\x00\x02M\x01'"
    >>> zlib.compress(u'abc')
    "x\x9cKLJ\x06\x00\x02M\x01'"
    >>> zlib.compress(u'abc\xe9')
    ...
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' ...

    I'm not sure that this behaviour was really wanted become the decompress operation is not symetric (the result type is always byte string):

    $ python
    Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:40)
    >>> import zlib
    >>> zlib.decompress("x\x9cKLJ\x06\x00\x02M\x01'")
    'abc'

    Python 3.0 accepts any string: bytes or characters. But decompress always produce bytes string:

    $ ./python
    Python 3.1a0 (py3k:67926M, Dec 26 2008, 23:59:07)
    >>> import zlib
    >>> zlib.compress(b'abc')
    b"x\x9cKLJ\x06\x00\x02M\x01'"
    >>> zlib.compress('abc')
    b"x\x9cKLJ\x06\x00\x02M\x01'"
    >>> zlib.compress('abc\xe9')
    b'x\x9cKLJ>\xbc\x12\x00\x06\xca\x02\x93'
    >>> zlib.compress('abc\xe9'.encode('utf-8'))
    b'x\x9cKLJ>\xbc\x12\x00\x06\xca\x02\x93'
    >>> zlib.decompress(b'x\x9cKLJ>\xbc\x12\x00\x06\xca\x02\x93')
    b'abc\xc3\xa9'

    The most strange operation is the decompression of an unicode string:

    $ ./python
    >>> zlib.decompress('x\x9cKLJ>\xbc\x12\x00\x06\xca\x02\x93')
    ...
    zlib.error: Error -3 while decompressing data: incorrect header check

    I propose to change zlib API to reject unicode string and use explicit conversion to/from bytes. Functions/methods:

    Note: binascii.crc32() already rejects unicode string.

    The behaviour may kept in Python 3.0.x and only changed in Python 3.1.

    vstinner commented 15 years ago

    See also issue bpo-4738 (better threads support in zlib).

    malemburg commented 15 years ago
    On 2008-12-27 13:58, STINNER Victor wrote:
    > Python 2.x allows to encode any byte string (str) and ASCII unicode 
    > string (unicode):
    > 
    > $ python
    > Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:40)
    >>>> import zlib
    >>>> zlib.compress('abc')
    > "x\x9cKLJ\x06\x00\x02M\x01'"
    >>>> zlib.compress(u'abc')
    > "x\x9cKLJ\x06\x00\x02M\x01'"
    >>>> zlib.compress(u'abc\xe9')
    > ...
    > UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' ...
    > 
    > I'm not sure that this behaviour was really wanted become the 
    > decompress operation is not symetric (the result type is always byte 
    > string):
    > 
    > $ python
    > Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:40)
    >>>> import zlib
    >>>> zlib.decompress("x\x9cKLJ\x06\x00\x02M\x01'")
    > 'abc'
    > 

    I don't see a problem with this. The fact that Python 2.x also accepts Unicode ASCII strings where strings are normally expected is intended to help with the migration to Unicode, so the above is expected.

    zlib itself doesn't care about whether the data to be encoded is text or bytes.

    In Python 3.x, it's probably better to use bytes throughout the API.

    4f9a157d-9f66-4b5a-8130-05a615b2bc3f commented 15 years ago

    I don't think Python 2.x should be changed - but 3.0 or 3.1 should be:

    vstinner commented 15 years ago

    The fact that Python 2.x also accepts Unicode ASCII strings where strings are normally expected is intended to help with the migration to Unicode

    I hate this behaviour. It doesn't help migration, it's the opposite! Sometimes it works (ASCII), and somtimes it fails (just one non-ASCII character). And then we will read "Unicode sucks!" because people doesn't understand the error.

    In Python 3.x, it's probably better to use bytes throughout the API.

    I propose to reject unicode in Python 3.x and display a warning for Python 2.x. A warning to prepare the migration... not to Unicode, but to Python3 ;-)

    4f9a157d-9f66-4b5a-8130-05a615b2bc3f commented 15 years ago

    The current behaviour may help the majority by ignorance and cause weird errors for others. We tell people that Python distincts between Text and Data but actually treat it all the same by implicit encoding.

    Modules that only operate on Bytes should reject Unicode-objects in Python3; it's a matter of 3 lines to display a warning in Python 2. Those modules that usually operate on Text but have single functions that operate on Bytes should display a warning but not enforce explicit encoding.

    Also see bpo-4821 and bpo-4818 where unicode already got rejected by the openssl-driven classes but silently accepted by the build-in ones.

    malemburg commented 15 years ago

    On 2009-01-04 23:51, STINNER Victor wrote:

    STINNER Victor \victor.stinner@haypocalc.com\ added the comment:

    > The fact that Python 2.x also accepts Unicode ASCII strings > where strings are normally expected is intended to help with > the migration to Unicode

    I hate this behaviour. It doesn't help migration, it's the opposite! Sometimes it works (ASCII), and somtimes it fails (just one non-ASCII character). And then we will read "Unicode sucks!" because people doesn't understand the error.

    Well, that's your opinion.

    The feature was added to get people work with Unicode at all, since otherwise we would have had to do all the Unicode porting we're doing now for Python 3 at the time Unicode was introduced - which was in version Python 1.6, eight years ago.

    At the time the Python community was a lot smaller and there wasn't all that much interest in Unicode anyway - the Unicode support I wrote for Python 1.6 was partially financed by HP which needed it for an application they had written in Python.

    See the introduction in PEP-100 for the motivation behind the design decisions:

    http://www.python.org/dev/peps/pep-0100/

    > In Python 3.x, it's probably better to use bytes throughout the > API.

    I propose to reject unicode in Python 3.x and display a warning for Python 2.x. A warning to prepare the migration... not to Unicode, but to Python3 ;-)

    Fair enough.

    vstinner commented 15 years ago

    > I propose to reject unicode in Python 3.x and display a warning for > Python 2.x. A warning to prepare the migration... not to Unicode, but to > Python3 ;-)

    Fair enough.

    The patch for Python 3.x is already attached to this issue. We might only apply this one and leave Python 2.x unchanged. Can someone review the patch?

    eda57068-96ad-4b33-8431-9c528f59a6a6 commented 14 years ago

    Definitely, zlib.compress should raise a TypeError (like bz2 does).

    >>> import bz2, zlib
    >>> bz2.compress('abc')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: argument 1 must be bytes or buffer, not str
    >>> zlib.compress('abc')
    b"x\x9cKLJ\x06\x00\x02M\x01'"

    Someone can review the patch and merge it?

    pitrou commented 14 years ago

    The patch lacks a test that TypeError is raised on unicode input, otherwise it's fine.

    eda57068-96ad-4b33-8431-9c528f59a6a6 commented 14 years ago

    Patch from haypo updated for r76830 .

    Additional tests for TypeError, and to check that bytearray objects are accepted.

    pitrou commented 14 years ago

    The patch produces a number of errors in test_tarfile, test_distutils, test_gzip and test_xmlrpc.

    eda57068-96ad-4b33-8431-9c528f59a6a6 commented 14 years ago

    Fixed.

    And some "bytearray" tests improved in test_zlib.

    pitrou commented 14 years ago

    The patch was committed to py3k and 3.1. Thank you!

    vstinner commented 14 years ago

    The patch was committed to py3k and 3.1. Thank you!

    r76836 and r76838