binascii module, inconsistent behavior: some functions accept unicode string input

1efe45e0-3fff-4905-80a3-b3581a2ab71f commented 15 years ago

BPO	4770
Nosy	@malemburg, @warsaw, @amauryfa, @pitrou, @vstinner, @merwok, @florentx
Files	case_binascii.py: Test case and output on Python 3.2 issue4770_binascii_py3k_v3.diff: Patch with documentation, apply to py3k

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = 'https://github.com/florentx' closed_at = created_at = labels = ['type-bug', 'library'] title = 'binascii module, inconsistent behavior: some functions accept unicode string input' updated_at = user = 'https://bugs.python.org/beazley' ``` bugs.python.org fields: ```python activity = actor = 'flox' assignee = 'flox' closed = True closed_date = closer = 'flox' components = ['Library (Lib)'] creation = creator = 'beazley' dependencies = [] files = ['15559', '15914'] hgrepos = [] issue_num = 4770 keywords = ['patch'] message_count = 17.0 messages = ['78470', '78473', '78475', '78581', '96394', '96399', '96401', '96402', '96403', '96404', '96407', '96414', '96418', '97893', '98381', '110836', '111733'] nosy_count = 9.0 nosy_names = ['lemburg', 'barry', 'beazley', 'amaury.forgeotdarc', 'pitrou', 'vstinner', 'eric.araujo', 'flox', 'BreamoreBoy'] pr_nums = [] priority = 'normal' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue4770' versions = ['Python 3.2'] ```

1efe45e0-3fff-4905-80a3-b3581a2ab71f commented 15 years ago

See bpo-4869 for a related bug.

Most of the functions in binascii are meant to go from binary data to textual representations (hex digits, base64, binhex, etc.). There are numerous problems:

Misleading error messages. For example:

>>> binascii.b2a_base64("Some text")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: b2a_base64() argument 1 must be string or buffer, not str
>>> binascii.crc32("Some text")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: crc32() argument 1 must be string or buffer, not str
>>>

Huh? Didn't I just pass a string? Error message should say "argument 1 must be bytes or buffer, not str".

This problem shows up with most of the other encoding functions too (i.e., b2a_uu).

Expected behavior with encoding/decoding.

The functions in this module are going from binary to text. To be consistent, I think the result of encoding operations such as b2a_uu(), b2a_base64(), should be strings, not bytes.

Similarly, decoding operations are going from text back to bytes. I think the input arguments should be allowed to be str (in addition to bytes if you want).

amauryfa commented 15 years ago

Item 1 was most probably fixed recently with r67929. Concerning item 2, I think it was decided that binascii is a bytes-only module.

I suggest to close this issue as "out of date".

1efe45e0-3fff-4905-80a3-b3581a2ab71f commented 15 years ago

Given the low-level nature of this module, I can understand the motivation to make it all bytes.

However, I'm going to respectfully disagree with that and claim that making binascii all bytes really goes against the whole spirit of what Python 3.0 has tried to do for Unicode. For example, throughout Python, you now have a clean separation between binary data (bytes) and text data (str). Well, it's cleanly separated everywhere except in the binascii module (and base64 module) which, ironically, is all about converting between binary data and text.

As it stands now, it's a huge wart IMHO.

vstinner commented 15 years ago

See also issue bpo-4769 (want base64.b64decode(str)).

eda57068-96ad-4b33-8431-9c528f59a6a6 commented 14 years ago

I agree, we need consistency between all functions of this package.

I've run a small script to check what happens for all 16 functions of the binascii package when they receive unicode input...

See attached script (and sample output).

IMHO 4 functions should be fixed to raise the same TypeError:

a2b_hex
a2b_qp
unhexlify
rledecode_hqx

Documentation says that the functions in binary module « convert between binary and various ASCII-encoded binary representations » So... it's all binary.

Implicit encoding should not happen, never.

eda57068-96ad-4b33-8431-9c528f59a6a6 commented 14 years ago

This patch removes implicit encoding in binascii functions:

a2b_hex (=unhexlify)
a2b_qp
rledecode_hqx
Tests module "test_binascii" is reviewed and simplified.
Fixes for "email", "pickle" and "quopri" modules to encode input.

All tests pass.

malemburg commented 14 years ago

flox wrote:

flox \laxyf@yahoo.fr\ added the comment:

This patch removes implicit encoding in binascii functions:

a2b_hex (=unhexlify)

a2b_qp

rledecode_hqx

Tests module "test_binascii" is reviewed and simplified.

Fixes for "email", "pickle" and "quopri" modules to encode input.

All tests pass.

Are you sure that this patch is correct (which RFC says that quoted printable should use our raw-unicode-escape codec ?):

Index: Lib/email/message.py \===================================================================

--- Lib/email/message.py    (revision 76839)
+++ Lib/email/message.py    (working copy)
@@ -198,6 +198,8 @@
             return None
         cte = self.get('content-transfer-encoding', '').lower()
         if cte == 'quoted-printable':
+            if isinstance(payload, str):
+                payload = payload.encode('raw-unicode-escape')
             return utils._qdecode(payload)
         elif cte == 'base64':
             try:

The patch also needs to fix the documentation and add a NEWS entry.

Making these APIs strict in the sense that they don't accept str-instances anymore needs to be mentioned very clearly.

We may even have to go through a deprecation process for them, since they can easily cause perfectly working code to suddenly fail.

eda57068-96ad-4b33-8431-9c528f59a6a6 commented 14 years ago

Are you sure that this patch is correct (which RFC says that quoted printable should use our raw-unicode-escape codec ?):

I am not sure of anything. It is an "educated guess" at the most. Since 'base64' and 'x-uuencode' both use 'raw-unicode-escape'...

See longer excerpt below.

Index: Lib/email/message.py \===================================================================

--- Lib/email/message.py        (revision 76839)
+++ Lib/email/message.py        (working copy)
@@ -189,24 +189,26 @@
         elif not isinstance(self._payload, list):
             raise TypeError('Expected list, got %s' % type(self._payload))
         else:
             payload = self._payload[i]
         if not decode:
             return payload
         # Decoded payloads always return bytes.  XXX split this part
out into
         # a new method called .get_decoded_payload().
         if self.is_multipart():
             return None
         cte = self.get('content-transfer-encoding', '').lower()
         if cte == 'quoted-printable':
+            if isinstance(payload, str):
+                payload = payload.encode('raw-unicode-escape')
             return utils._qdecode(payload)
         elif cte == 'base64':
             try:
                 if isinstance(payload, str):
                     payload = payload.encode('raw-unicode-escape')
                 return base64.b64decode(payload)
                 #return utils._bdecode(payload)
             except binascii.Error:
                 # Incorrect padding
                 pass
         elif cte in ('x-uuencode', 'uuencode', 'uue', 'x-uue'):
             in_file = BytesIO(payload.encode('raw-unicode-escape'))

malemburg commented 14 years ago

flox wrote:

flox \laxyf@yahoo.fr\ added the comment:

> Are you sure that this patch is correct (which RFC says > that quoted printable should use our raw-unicode-escape > codec ?):

I am not sure of anything. It is an "educated guess" at the most. Since 'base64' and 'x-uuencode' both use 'raw-unicode-escape'...

Quoted printable as well as the other two transfer encodings should be encodings that "fit" into the 7-bit ASCII default originally assumed for email messages, so 'ascii' appears to be the more natural choice.

The choice of 'raw-unicode-escape' will cause strange error messages or hide errors, since it encodes non-ASCII code points using '\xNN' which these codecs don't supports:

b''
>>> base64.b64decode('äöü'.encode('ascii'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

pitrou commented 14 years ago

I agree with Marc-André that enforcing ASCII looks like the only sensible solution. Perhaps Barry knows the purpose of using "raw-unicode-escape" here, though?

malemburg commented 14 years ago

The email interface ate part of the code snippet. Here's the complete version:

>>> import base64
>>> base64.b64decode('äöü'.encode('raw-unicode-escape'))
b''
>>> base64.b64decode('äöü'.encode('ascii'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-2: ordinal not in range(128)

eda57068-96ad-4b33-8431-9c528f59a6a6 commented 14 years ago

I perform a "grep" on the email package (with patch applied). There's some places where 'raw-unicode-escape' is used.

I understand that all "payload.encode('raw-unicode-escape')" should be changed to "payload.encode('ascii')" when the payload is processed by another transfer encoding (line 202, 207 and 214). The last one (when 'content-transfer-encoding' is not recognized) should remain unchanged?

All other uses of 'raw-unicode-escape' seem correct.

Lib/email/base64mime.py:112: return a2b_base64(string.encode('raw-unicode-escape')) Lib/email/header.py:111: word = bytes(word, 'raw-unicode-escape') Lib/email/message.py:202: payload = payload.encode('raw-unicode-escape') Lib/email/message.py:207: payload = payload.encode('raw-unicode-escape') Lib/email/message.py:214: in_file = BytesIO(payload.encode('raw-unicode-escape')) Lib/email/message.py:225: return payload.encode('raw-unicode-escape') Lib/email/message.py:767: as_bytes = charset[2].encode('raw-unicode-escape') Lib/email/utils.py:298: rawbytes = bytes(text, 'raw-unicode-escape')

eda57068-96ad-4b33-8431-9c528f59a6a6 commented 14 years ago

Patch updated:

Added documentation
Switched to "payload.encode('ascii')" in the "email.message" module

Note: there was no ambiguity on the first line of the documentation: « [These functions] convert between binary and various ASCII-encoded binary representations »

So, it should only convert bytes in range(256) to bytes in range(128).

eda57068-96ad-4b33-8431-9c528f59a6a6 commented 14 years ago

Patch updated, after 7703 is merged in py3k.

malemburg commented 14 years ago

Please also add a NEWS entry which mentions the changes to those APIs and the change in the email package.

Thanks.

83d2e70e-e599-4a04-b820-3814bbdb9bef commented 14 years ago

Florent, could you add a NEWS entry as requested in msg98381, then we should be able to get this committed.

eda57068-96ad-4b33-8431-9c528f59a6a6 commented 14 years ago

Fixed with r83182.

python / cpython

binascii module, inconsistent behavior: some functions accept unicode string input #49020