Open gpshead opened 6 years ago
I've tried reading various RFCs around Base64 encoding, but I couldn't make the ends meet. Yet there is an inconsistency between base64.decodebytes() and base64.decode() in that how they handle linebreaks that were used to collate the encoded text. Below is an example of what I'm talking about:
>>> import base64
>>> foo = base64.encodebytes(b'123456789')
>>> foo
b'MTIzNDU2Nzg5\n'
>>> foo = b'MTIzND\n' + b'U2Nzg5\n'
>>> foo
b'MTIzND\nU2Nzg5\n'
>>> base64.decodebytes(foo)
b'123456789'
>>> from io import BytesIO
>>> bytes_in = BytesIO(foo)
>>> bytes_out = BytesIO()
>>> bytes_in.seek(0)
0
>>> base64.decode(bytes_in, bytes_out)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/somewhere/lib/python3.6/base64.py", line 512, in decode
s = binascii.a2b_base64(line)
binascii.Error: Incorrect padding
>>> bytes_in = BytesIO(base64.encodebytes(b'123456789'))
>>> bytes_in.seek(0)
0
>>> base64.decode(bytes_in, bytes_out)
>>> bytes_out.getvalue()
b'123456789'
Obviously, I'd expect encodebytes() and encode both to either accept or to reject the same input.
Thanks.
Oleg
via Oleg Sivokon on python-dev (who was having trouble getting bugs.python.org account creation to work)
This reduces to the following:
>>> from binascii import a2b_base64 as f
>>> f(b'MTIzND\nU2Nzg5\n')
b'123456789'
>>> f(b'MTIzND\n')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
binascii.Error: Incorrect padding
That is, decode does its decoding line by line, whereas decodebytes passes the entire object to a2b_base64 as a single entity. Apparently a2b_base64 looks at the padding for the entirety of what it is given, which I believe is in accordance with the RFC. This means that decode is fundamentally broken per the RFC, and there is no obvious way to fix it without adding an incremental decoder to binascii. And an incremental decoder probably belongs in codecs (assuming we ever resolved the transcode interface issue, I can't actually remember...).
Note that it will work as long as an "integral" number of base64 encoding units are in each line.
I wrote an incremental base-64 decoder for the "codecs" module in bpo-27799, which you could use. It just does some preprocessing using a regular expression to pick four-character chunks before passing the data to a2b_base64. Or maybe implementing it properly in the "binascii" module is better.
Quickly reading RFC 2045, I saw it says "All line breaks or other characters not found in Table 1 [64 alphabet characters plus padding character] must be ignored by decoding software." So this is a real bug, although I think a base-64 encoder that triggers it would be rare.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at = None created_at =
labels = ['3.7', 'type-bug', 'library']
title = 'base64.decode: linebreaks are not ignored'
updated_at =
user = 'https://github.com/gpshead'
```
bugs.python.org fields:
```python
activity =
actor = 'martin.panter'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation =
creator = 'gregory.p.smith'
dependencies = []
files = []
hgrepos = []
issue_num = 32491
keywords = []
message_count = 3.0
messages = ['309449', '309451', '309454']
nosy_count = 3.0
nosy_names = ['gregory.p.smith', 'r.david.murray', 'martin.panter']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue32491'
versions = ['Python 3.6', 'Python 3.7']
```