python / cpython

The Python programming language
https://www.python.org
Other
63.42k stars 30.37k forks source link

base64.decode: linebreaks are not ignored #76672

Open gpshead opened 6 years ago

gpshead commented 6 years ago
BPO 32491
Nosy @gpshead, @bitdancer, @vadmium

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['3.7', 'type-bug', 'library'] title = 'base64.decode: linebreaks are not ignored' updated_at = user = 'https://github.com/gpshead' ``` bugs.python.org fields: ```python activity = actor = 'martin.panter' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'gregory.p.smith' dependencies = [] files = [] hgrepos = [] issue_num = 32491 keywords = [] message_count = 3.0 messages = ['309449', '309451', '309454'] nosy_count = 3.0 nosy_names = ['gregory.p.smith', 'r.david.murray', 'martin.panter'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue32491' versions = ['Python 3.6', 'Python 3.7'] ```

gpshead commented 6 years ago

I've tried reading various RFCs around Base64 encoding, but I couldn't make the ends meet. Yet there is an inconsistency between base64.decodebytes() and base64.decode() in that how they handle linebreaks that were used to collate the encoded text. Below is an example of what I'm talking about:

>>> import base64
>>> foo = base64.encodebytes(b'123456789')
>>> foo
b'MTIzNDU2Nzg5\n'
>>> foo = b'MTIzND\n' + b'U2Nzg5\n'
>>> foo
b'MTIzND\nU2Nzg5\n'
>>> base64.decodebytes(foo)
b'123456789'
>>> from io import BytesIO
>>> bytes_in = BytesIO(foo)
>>> bytes_out = BytesIO()
>>> bytes_in.seek(0)
0
>>> base64.decode(bytes_in, bytes_out)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/somewhere/lib/python3.6/base64.py", line 512, in decode
    s = binascii.a2b_base64(line)
binascii.Error: Incorrect padding
>>> bytes_in = BytesIO(base64.encodebytes(b'123456789'))
>>> bytes_in.seek(0)
0
>>> base64.decode(bytes_in, bytes_out)
>>> bytes_out.getvalue()
b'123456789'

Obviously, I'd expect encodebytes() and encode both to either accept or to reject the same input.

Thanks.

Oleg

via Oleg Sivokon on python-dev (who was having trouble getting bugs.python.org account creation to work)

bitdancer commented 6 years ago

This reduces to the following:

>>> from binascii import a2b_base64 as f
>>> f(b'MTIzND\nU2Nzg5\n')
b'123456789'
>>> f(b'MTIzND\n')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
binascii.Error: Incorrect padding

That is, decode does its decoding line by line, whereas decodebytes passes the entire object to a2b_base64 as a single entity. Apparently a2b_base64 looks at the padding for the entirety of what it is given, which I believe is in accordance with the RFC. This means that decode is fundamentally broken per the RFC, and there is no obvious way to fix it without adding an incremental decoder to binascii. And an incremental decoder probably belongs in codecs (assuming we ever resolved the transcode interface issue, I can't actually remember...).

Note that it will work as long as an "integral" number of base64 encoding units are in each line.

vadmium commented 6 years ago

I wrote an incremental base-64 decoder for the "codecs" module in bpo-27799, which you could use. It just does some preprocessing using a regular expression to pick four-character chunks before passing the data to a2b_base64. Or maybe implementing it properly in the "binascii" module is better.

Quickly reading RFC 2045, I saw it says "All line breaks or other characters not found in Table 1 [64 alphabet characters plus padding character] must be ignored by decoding software." So this is a real bug, although I think a base-64 encoder that triggers it would be rare.