base64 module ignores non-alphabet characters

e8ebadac-a32c-4449-b78d-d0702bd98704 commented 18 years ago

BPO	1466065
Nosy	@amauryfa, @pitrou, @devdanzin, @merwok, @bitdancer, @Julian
Files	1466065[3.2-complete].diff: Unit-test/solution/documentation update for Python 3.2 1466065[2.7-complete].diff: Unit-test/solution/documentation update for Python 2.7 binascii-validate.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = 'https://github.com/bitdancer' closed_at = created_at = labels = ['type-bug', 'library'] title = 'base64 module ignores non-alphabet characters' updated_at = user = 'https://bugs.python.org/sanxiyn' ``` bugs.python.org fields: ```python activity = actor = 'Julian' assignee = 'r.david.murray' closed = True closed_date = closer = 'r.david.murray' components = ['Library (Lib)'] creation = creator = 'sanxiyn' dependencies = [] files = ['14600', '14601', '19548'] hgrepos = [] issue_num = 1466065 keywords = ['patch'] message_count = 33.0 messages = ['60903', '91023', '91024', '91025', '91039', '91043', '91045', '91046', '91047', '91048', '91049', '91050', '91052', '91053', '91054', '91055', '91056', '91057', '91060', '91061', '91062', '91064', '91065', '91067', '91068', '114650', '115687', '116049', '116072', '120803', '120966', '176409', '224090'] nosy_count = 9.0 nosy_names = ['amaury.forgeotdarc', 'sanxiyn', 'pitrou', 'ajaksu2', 'eric.araujo', 'r.david.murray', 'Red HamsterX', 'Julian', 'ltaczuk'] pr_nums = [] priority = 'normal' resolution = 'accepted' stage = 'resolved' status = 'closed' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue1466065' versions = ['Python 3.2'] ```

e8ebadac-a32c-4449-b78d-d0702bd98704 commented 18 years ago

See: http://mail.python.org/pipermail/python-dev/2006-April/063504.html

base64.b64decode function ignores non-alphabet characters, which contradicts its documentation: see http://docs.python.org/lib/module-base64.html

Also, the exception must be ValueError instead of TypeError.

158c8639-50e4-4e0c-bb99-86e636005c4a commented 15 years ago

Attached a patch that modifies Modules/binascii.c to raise a TypeError on any unrecognized base64 input, rather than failing silently. Also includes unit tests.

Patch built against Python 3.2, r74246.

TypeError was preserved over ValueError, despite the lack of related logic, for legacy purposes.

158c8639-50e4-4e0c-bb99-86e636005c4a commented 15 years ago

Attached a patch that modifies Lib/base64.py to raise a TypeError on any unrecognized base64 input, rather than failing silently. Also includes unit tests.

Patch built against Python 2.7, r74246.

158c8639-50e4-4e0c-bb99-86e636005c4a commented 15 years ago

Attached a patch that modifies Lib/base64.py to raise a TypeError on any unrecognized base64 input, rather than continuing silently. Also includes unit tests.

Patch built against Python 3.2, r74246.

Correction to previous text: base64 continues silently, discarding unrecognized characters, rather than failing.

amauryfa commented 15 years ago

The new unit tests pass without modifying the library. Could you include a case that fails with the current version?

158c8639-50e4-4e0c-bb99-86e636005c4a commented 15 years ago

I can't add a test for that without changing unrelated behaviour in the library that is already known to work fine or checking the string value of the raised exception, which seems like a bad idea, even though it would work.

If a character is ignored and this leads to a padding-length issue, TypeError is raised in both 2.7 and 3.2: 2.7 because everything is a TypeError, and 3.2 because binascii.Error is converted to TypeError for legacy purposes.

If a character is ignored and the string's length is still acceptable, then no error is reported because this was a silent problem.

Post-library-modification, both of these cases will uniformly produce the proper error, although it is, through type-checking alone, indistinguishable from the errors that would have existed before -- the value is in the fact that it will tell the user the nature of the failure, and it will be noisy when it may have been silent before.

bitdancer commented 15 years ago

If "it may be noisy where it was silent before", then add one of those cases and make sure the noise doesn't happen before your fix, and does happen after.

If you have to check the value of the exception string for other tests, then do so. There are plenty of examples of this in the existing tests, (see the pydoc tests, for example). If you can limit what you test for so that the test will be resitent to changes in the exact text, so much the better. You can use assertRaisesRegexp for this in 2.7.

amauryfa commented 15 years ago

What is the correct behavior for something like this? base64.b64decode('!') 2.6 silently returns ''.

158c8639-50e4-4e0c-bb99-86e636005c4a commented 15 years ago

According to the documentation cited by Seo Sanghyeon in the first post, "A TypeError is raised if s were incorrectly padded or if there are non-alphabet characters present in the string."

The valid range of characters is [A-Za-z0-9], and one or two '='s may appear at the end of the input to signify dimension-padding. Therefore, '!' should fail with a TypeError alerting the user to the presence of unrecognized data, rather than being discarded.

(Additionally, it looks like newline characters in the input aren't unheard of, and those are probably the reason behind the silent ignores)

158c8639-50e4-4e0c-bb99-86e636005c4a commented 15 years ago

R. David Murray, should I update the patches for both the pure-Python solution and the C solution, or is one domain preferable here? The Python-based solution keeps all of the invalid-character TypeErrors collected in the same module, but the C-based solution allows this problem to be caught more efficiently.

amauryfa commented 15 years ago

Therefore, '!' should fail with a TypeError Here is your test case! "Errors should never pass silently."

bitdancer commented 15 years ago

Amaury is probably better qualified to answer that question, but I would think the C code version is preferable.

amauryfa commented 15 years ago

I've dig into the history of the file and found this change: http://svn.python.org/view?view=rev&revision=13939 "- Illegal padding is now ignored. (Recommendation by GvR.)"

The motivation at the time was based on "the general Internet philosophy": http://mail.python.org/pipermail/python-bugs-list/1999-October/000234.html

I don't know if this is still valid 10 years later, though.

pitrou commented 15 years ago

Perhaps Guido remembers why the decision was made.

158c8639-50e4-4e0c-bb99-86e636005c4a commented 15 years ago

I'll hold off on uploading new patches until someone makes a decision, then.

It seems like, perhaps, simply amending the documentation would be sufficient, since this behaviour shouldn't break any valid messages that might reach this module. At worst, it'll just treat gibberish as valid, and that's what it's been doing for a decade. (Although the other decode routines are all strict by comparison)

gvanrossum commented 15 years ago

It strikes me as simply a documentation bug. Maybe whoever wrote the docs just assumed it would work that way. I expect that changing it to insist on valid input would break some use cases. If you want a validating API, perhaps a new API could be added that validates as well as converts.

bitdancer commented 15 years ago

It turns out that the RFC is unambiguous on this point. Quoting the base64 section of RFC 2045:

The encoded output stream must be represented in lines of no more than 76 characters each. All line breaks or other characters not found in Table 1 must be ignored by decoding software. In base64 data, characters other than those in Table 1, line breaks, and other white space probably indicate a transmission error, about which a warning message or even a message rejection might be appropriate under some circumstances.

Since "some circumstances" is not something the base64 decoder can decide, that has to be left to a higher level ap. So if unexpected characters are to generate an error, it would need to be enabled via a flag that defaults to not raising the error, IMO. Unless someone has a use case for rejecting an improperly encoded message, we should probably just fix the docs (perhaps noting that this behavior is in accordance with the RFC).

158c8639-50e4-4e0c-bb99-86e636005c4a commented 15 years ago

RFC 3548, referenced by the base64 module's docs, has a rather different statement on how invalid characters should be treated.

From 2.3 Interpretation of non-alphabet characters in encoded data: Implementations MUST reject the encoding if it contains characters outside the base alphabet when interpreting base encoded data, unless the specification referring to this document explicitly states otherwise. Such specifications may, as MIME does, instead state that characters outside the base encoding alphabet should simply be ignored when interpreting data ("be liberal in what you accept").

So it looks like we can safely just say that invalid characters are ignored in the docs, as long as it's explicit, but that's probably not what people will expect.

I'll add doc patches in a moment, and someone who's actually a developer (i.e., not me) can decide whether they're good enough.

158c8639-50e4-4e0c-bb99-86e636005c4a commented 15 years ago

Attached a documentation patch indicating that the ignore-invalid-characters behaviour is intentional, citing relevant RFCs for support.

Patch built against Python 3.2, r74261.

bitdancer commented 15 years ago

Hmm. But if the module is used outside of MIME (which it can be, and in fact is in the stdlib itself), then an error must be raised in order to comply with that RFC. So it sounds like we really ought to have that flag. And I was even wrong about the appropriate default.

158c8639-50e4-4e0c-bb99-86e636005c4a commented 15 years ago

It isn't written that only MIME may ignore such content. The key terms there are 'may' and 'explicitly states otherwise'.

If the documentation is clear, then all future application developers will know to check for validity using a regular expression like '^[A-Za-z0-9+/\r\n]+={0,2}$'. Any existing applications in which validity matters should already have a similar workaround.

While I do agree that standards are always good and that workarounds are bad, Guido does have a very valid point: "changing it to insist on valid input would break some use cases", and I think we already missed the 2.x -> 3.x window where that would have been acceptable.

pitrou commented 15 years ago

If the documentation is clear, then all future application developers will know to check for validity using a regular expression like '^[A-Za-z0-9+/\r\n]+={0,2}$'. Any existing applications in which validity matters should already have a similar workaround.

But having to validate input manually kinds of defeats the point of having a decoder in the stdlib, therefore I agree with MRAB that a validation flag would be useful.

bitdancer commented 15 years ago

And if the flag defaults to the current behavior that should satisfy the backward compatiblity issues.

158c8639-50e4-4e0c-bb99-86e636005c4a commented 15 years ago

Attached a documentation/unit-test/solution patch that adds a validate=False parameter to the b64decode function of Lib/base64.py, which may be set to True to have invalid base64 content be rejected with a TypeError.

Patch built against Python 3.2, r74261.

158c8639-50e4-4e0c-bb99-86e636005c4a commented 15 years ago

Attached a documentation/unit-test/solution patch that adds a validate=False parameter to the b64decode function of Lib/base64.py, which may be set to True to have invalid base64 content be rejected with a TypeError.

Patch built against Python 2.7, r74261.

Note: Sorry this went on for so long. However, I think I understand the patch-submission process a lot better now.

83d2e70e-e599-4a04-b820-3814bbdb9bef commented 14 years ago

Attached patches against 2.7 and 3.2 contain doc and unit test changes so can someone please review and commit if acceptable.

bitdancer commented 14 years ago

Comments on patch:

1) if I'm reading the RFC correctly, to be validating strictly in compliance with the RFC \r and \n should also raise an error. Do you agree?

2) We've pretty much dropped the convention of adding history notes to the file itself.

3) The original code documented a TypeError on incorrect padding but in py3k in fact raises a binascii.Error, as you noted in the patch. I wonder if it would be better to raise a binascii.Error on invalid data as well, since it isn't, strictly speaking, a TypeError. That would also make it easier to move the code into the C module later if that is deemed desirable.

4) The negative in the doc language ("If validate is not set to True...") is awkward and confusing. Better would be "If validate is False (the default)..."

Since the patch does add an API change (AKA a feature) I think this can only go into 3.2.

I have some additional concerns when considering this in the context of email6. email6 will have a 'strict' mode where it would be sensible for invalid base64 to raise an error. But in non-strict mode, it would be ideal to have a way to (a) know if there is invalid input, but still decode it, and (b) decode it even if the padding is off after ignoring the invalid data. I'm not saying that this patch should try to address those issues, I just want to put them on record in case I want to do something about them later.

158c8639-50e4-4e0c-bb99-86e636005c4a commented 14 years ago

I agree that it does look like RFC 3548 implicitly denies the use of \r and \n. A few *very* simple tests also makes it look like ommitting them breaks nothing, but we'd need to add a test that includes them before applying the patch.

Other than that, I agree with everything and find the email6 notes intruiging.

Would you like me to update the patch, or do you want to do it?

bitdancer commented 14 years ago

If you could update it that would be great.

bitdancer commented 13 years ago

Here is an updated patch that addresses the concerns I noted. I modified the tests: given that I've changed the code to raise binascii.Error as discussed, we don't really care from an API point of view what the error text is, just that the error is raised in the cases given. Nor do we care (again, from an API point of view) if the first error detected is the invalid character or the invalid padding.

I made the new test test the validate=False case, so all of the invalid character tests are now correctly padded if you ignore the invalid characters.

bitdancer commented 13 years ago

Committed in r86414.

e12aaf60-3019-4b2b-8d30-55706aa0918b commented 11 years ago

Could someone update the docs for python 2.7.3?

This ticket is marked as closed and b64decode still silently ignores non base64 chars, but the documentation (for 2.7.3) still states that while calling b64decode "A TypeError is raised if (...) or if there are non-alphabet characters present in the string.". http://docs.python.org/2.7/library/base64.html

8a0c69e7-f5ea-4ea3-8880-4903ed9740d7 commented 10 years ago

Created bpo-22088 to address not having fixed Py2 here.

python / cpython

base64 module ignores non-alphabet characters #43175