Closed 1efe45e0-3fff-4905-80a3-b3581a2ab71f closed 14 years ago
See bpo-4869 for a related bug.
Most of the functions in binascii are meant to go from binary data to textual representations (hex digits, base64, binhex, etc.). There are numerous problems:
>>> binascii.b2a_base64("Some text")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: b2a_base64() argument 1 must be string or buffer, not str
>>> binascii.crc32("Some text")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: crc32() argument 1 must be string or buffer, not str
>>>
Huh? Didn't I just pass a string? Error message should say "argument 1 must be bytes or buffer, not str".
This problem shows up with most of the other encoding functions too (i.e., b2a_uu).
The functions in this module are going from binary to text. To be consistent, I think the result of encoding operations such as b2a_uu(), b2a_base64(), should be strings, not bytes.
Similarly, decoding operations are going from text back to bytes. I think the input arguments should be allowed to be str (in addition to bytes if you want).
Item 1 was most probably fixed recently with r67929. Concerning item 2, I think it was decided that binascii is a bytes-only module.
I suggest to close this issue as "out of date".
Given the low-level nature of this module, I can understand the motivation to make it all bytes.
However, I'm going to respectfully disagree with that and claim that making binascii all bytes really goes against the whole spirit of what Python 3.0 has tried to do for Unicode. For example, throughout Python, you now have a clean separation between binary data (bytes) and text data (str). Well, it's cleanly separated everywhere except in the binascii module (and base64 module) which, ironically, is all about converting between binary data and text.
As it stands now, it's a huge wart IMHO.
See also issue bpo-4769 (want base64.b64decode(str)).
I agree, we need consistency between all functions of this package.
I've run a small script to check what happens for all 16 functions of the binascii package when they receive unicode input...
See attached script (and sample output).
IMHO 4 functions should be fixed to raise the same TypeError:
Documentation says that the functions in binary
module « convert
between binary and various ASCII-encoded binary representations »
So... it's all binary.
Implicit encoding should not happen, never.
This patch removes implicit encoding in binascii functions:
a2b_hex (=unhexlify)
a2b_qp
rledecode_hqx
Tests module "test_binascii" is reviewed and simplified.
Fixes for "email", "pickle" and "quopri" modules to encode input.
All tests pass.
flox wrote:
flox \laxyf@yahoo.fr\ added the comment:
This patch removes implicit encoding in binascii functions:
a2b_hex (=unhexlify)
a2b_qp
rledecode_hqx
Tests module "test_binascii" is reviewed and simplified.
Fixes for "email", "pickle" and "quopri" modules to encode input.
All tests pass.
Are you sure that this patch is correct (which RFC says that quoted printable should use our raw-unicode-escape codec ?):
Index: Lib/email/message.py \===================================================================
--- Lib/email/message.py (revision 76839)
+++ Lib/email/message.py (working copy)
@@ -198,6 +198,8 @@
return None
cte = self.get('content-transfer-encoding', '').lower()
if cte == 'quoted-printable':
+ if isinstance(payload, str):
+ payload = payload.encode('raw-unicode-escape')
return utils._qdecode(payload)
elif cte == 'base64':
try:
The patch also needs to fix the documentation and add a NEWS entry.
Making these APIs strict in the sense that they don't accept str-instances anymore needs to be mentioned very clearly.
We may even have to go through a deprecation process for them, since they can easily cause perfectly working code to suddenly fail.
Are you sure that this patch is correct (which RFC says that quoted printable should use our raw-unicode-escape codec ?):
I am not sure of anything. It is an "educated guess" at the most. Since 'base64' and 'x-uuencode' both use 'raw-unicode-escape'...
See longer excerpt below.
Index: Lib/email/message.py \===================================================================
--- Lib/email/message.py (revision 76839)
+++ Lib/email/message.py (working copy)
@@ -189,24 +189,26 @@
elif not isinstance(self._payload, list):
raise TypeError('Expected list, got %s' % type(self._payload))
else:
payload = self._payload[i]
if not decode:
return payload
# Decoded payloads always return bytes. XXX split this part
out into
# a new method called .get_decoded_payload().
if self.is_multipart():
return None
cte = self.get('content-transfer-encoding', '').lower()
if cte == 'quoted-printable':
+ if isinstance(payload, str):
+ payload = payload.encode('raw-unicode-escape')
return utils._qdecode(payload)
elif cte == 'base64':
try:
if isinstance(payload, str):
payload = payload.encode('raw-unicode-escape')
return base64.b64decode(payload)
#return utils._bdecode(payload)
except binascii.Error:
# Incorrect padding
pass
elif cte in ('x-uuencode', 'uuencode', 'uue', 'x-uue'):
in_file = BytesIO(payload.encode('raw-unicode-escape'))
flox wrote:
flox \laxyf@yahoo.fr\ added the comment:
> Are you sure that this patch is correct (which RFC says > that quoted printable should use our raw-unicode-escape > codec ?):
I am not sure of anything. It is an "educated guess" at the most. Since 'base64' and 'x-uuencode' both use 'raw-unicode-escape'...
Quoted printable as well as the other two transfer encodings should be encodings that "fit" into the 7-bit ASCII default originally assumed for email messages, so 'ascii' appears to be the more natural choice.
The choice of 'raw-unicode-escape' will cause strange error messages or hide errors, since it encodes non-ASCII code points using '\xNN' which these codecs don't supports:
b''
>>> base64.b64decode('äöü'.encode('ascii'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
I agree with Marc-André that enforcing ASCII looks like the only sensible solution. Perhaps Barry knows the purpose of using "raw-unicode-escape" here, though?
The email interface ate part of the code snippet. Here's the complete version:
>>> import base64
>>> base64.b64decode('äöü'.encode('raw-unicode-escape'))
b''
>>> base64.b64decode('äöü'.encode('ascii'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-2: ordinal not in range(128)
I perform a "grep" on the email package (with patch applied). There's some places where 'raw-unicode-escape' is used.
I understand that all "payload.encode('raw-unicode-escape')" should be changed to "payload.encode('ascii')" when the payload is processed by another transfer encoding (line 202, 207 and 214). The last one (when 'content-transfer-encoding' is not recognized) should remain unchanged?
All other uses of 'raw-unicode-escape' seem correct.
Lib/email/base64mime.py:112: return a2b_base64(string.encode('raw-unicode-escape')) Lib/email/header.py:111: word = bytes(word, 'raw-unicode-escape') Lib/email/message.py:202: payload = payload.encode('raw-unicode-escape') Lib/email/message.py:207: payload = payload.encode('raw-unicode-escape') Lib/email/message.py:214: in_file = BytesIO(payload.encode('raw-unicode-escape')) Lib/email/message.py:225: return payload.encode('raw-unicode-escape') Lib/email/message.py:767: as_bytes = charset[2].encode('raw-unicode-escape') Lib/email/utils.py:298: rawbytes = bytes(text, 'raw-unicode-escape')
Patch updated:
Note: there was no ambiguity on the first line of the documentation: « [These functions] convert between binary and various ASCII-encoded binary representations »
So, it should only convert bytes in range(256) to bytes in range(128).
Patch updated, after 7703 is merged in py3k.
Please also add a NEWS entry which mentions the changes to those APIs and the change in the email package.
Thanks.
Florent, could you add a NEWS entry as requested in msg98381, then we should be able to get this committed.
Fixed with r83182.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = 'https://github.com/florentx' closed_at =
created_at =
labels = ['type-bug', 'library']
title = 'binascii module, inconsistent behavior: some functions accept unicode string input'
updated_at =
user = 'https://bugs.python.org/beazley'
```
bugs.python.org fields:
```python
activity =
actor = 'flox'
assignee = 'flox'
closed = True
closed_date =
closer = 'flox'
components = ['Library (Lib)']
creation =
creator = 'beazley'
dependencies = []
files = ['15559', '15914']
hgrepos = []
issue_num = 4770
keywords = ['patch']
message_count = 17.0
messages = ['78470', '78473', '78475', '78581', '96394', '96399', '96401', '96402', '96403', '96404', '96407', '96414', '96418', '97893', '98381', '110836', '111733']
nosy_count = 9.0
nosy_names = ['lemburg', 'barry', 'beazley', 'amaury.forgeotdarc', 'pitrou', 'vstinner', 'eric.araujo', 'flox', 'BreamoreBoy']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue4770'
versions = ['Python 3.2']
```