Inconsistency between quopri.decodestring() and email.quoprimime.decode()

serhiy-storchaka commented 11 years ago

BPO	18022
Nosy	@gvanrossum, @warsaw, @bitdancer, @jeremyhylton, @vadmium, @serhiy-storchaka

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['3.7', 'type-bug', 'library', 'expert-email'] title = 'Inconsistency between quopri.decodestring() and email.quoprimime.decode()' updated_at = user = 'https://github.com/serhiy-storchaka' ``` bugs.python.org fields: ```python activity = actor = 'r.david.murray' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)', 'email'] creation = creator = 'serhiy.storchaka' dependencies = [] files = [] hgrepos = [] issue_num = 18022 keywords = [] message_count = 12.0 messages = ['189663', '190874', '190876', '190889', '197665', '197715', '290521', '290525', '290533', '316563', '316622', '316642'] nosy_count = 6.0 nosy_names = ['gvanrossum', 'barry', 'r.david.murray', 'Jeremy.Hylton', 'martin.panter', 'serhiy.storchaka'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue18022' versions = ['Python 2.7', 'Python 3.5', 'Python 3.6', 'Python 3.7'] ```

serhiy-storchaka commented 11 years ago

>>> import quopri, email.quoprimime
>>> quopri.decodestring(b'==41')
b'=41'
>>> email.quoprimime.decode('==41')
'=A'

I don't see a rule about double '=' in RFC 1521-1522 or RFCs 2045-2047 and I think quopri is wrong.

Other half of this bug (encoding '=' as '==') was fixed in 9bc52706d283.

serhiy-storchaka commented 11 years ago

There are other inconsistencies. email.quoprimime.decode(), binascii.a2b_qp() and pure Python (by default binascii used) quopri.decodestring() returns different results for following data:

       quoprimime  binascii  quopri

b'=' '' b'' b'=' b'==' '=' b'=' b'==' b'= ' '' b'= ' b'= ' b'= \n' '' b'= \n' b'' b'=\r' '' b'' b'=\r' b'==41' '=A' b'=41' b'=A'

bitdancer commented 11 years ago

Most of the variations represent different invalid-input recovery choices. I believe binascii's decoding of b'= \n' is incorrect, as is its decoding of b'==41'. quopri's decoding of b'=\r' is arguably incorrect as well, given that python generally supports universal line ends. Otherwise the decodings are all responses to erroneous input for which the behavior is not specified.

That said, we ought to pick one error recovery scheme and implement it in all places, and IMO it shouldn't be exactly any of the ones we've got. Or better yet, use one common implementation. Untangling quopri is on my (too large) List of Things To Do :)

serhiy-storchaka commented 11 years ago

Perl's MIME::QuotedPrint produces same result as pure Python quopri. konwert qp-8bit produces same result as binascii (except '==41' it decodes as '=A').

RFC 2045 says:

"""A reasonable approach by a robust implementation might be to include the "=" character and the following character in the decoded data without any transformation and, if possible, indicate to the user that proper decoding was not possible at this point in the data. """

serhiy-storchaka commented 11 years ago

So what scheme we will picked?

bitdancer commented 11 years ago

As I said, not exactly any of the above.

I'll get back to this after I finish the new email code (which should happen before the end of the month). I need to take some time to look over the RFCs and real world examples and come up with the most appropriate rules.

serhiy-storchaka commented 7 years ago

Ping.

vadmium commented 7 years ago

The double equals "==" case for the “quopri” implementation in Python is now consistent with the others thanks to the fix in bpo-23681 (see also bpo-21511).

According to bpo-20121, the quopri (Python) implementation only supports LF (\n) characters as line breaks, and the binascii (C) implementation also supports CRLF. So I agree that the whitespace-before-newline case "= \n" is a genuine bug (see bpo-16473). But the CR case "=\r" is not supported because neither quopri nor binascii support universal newlines or CR line breaks on their own.

serhiy-storchaka commented 7 years ago

Thus currently the table of discrepancies looks as:

       quoprimime  binascii  quopri

b'=' '' b'' b'=' b'= ' '' b'= ' b'= ' b'= \n' '' b'= \n' b'' b'=\r' '' b'' b'=\r' b'==41' '=A' b'=41' b'=41'

bitdancer commented 6 years ago

OK, I've finally gotten around to looking at this. It looks like quopri and binascii are not stripping trailing whitespace.

          quoprimime  binascii     quopri       preferred

b'=' '' b'' b'=' '=' b'= ' '' b'= ' b'= ' '=' b'= \n' '' b'= \n' b'' quoprimime binascii quopri

b'=' '' b'' b'=' b'= ' '' b'= ' b'= ' b'= \n' '' b'= \n' b'' b'=\r' '' b'' b'=\r' b'==41' '=A' b'=41' b'=41' '=\n' b'=\r' '' b'' b'=\r' '=\r' b'==41' '=A' b'=41' b'=41' '=A' b'= \n f\n' ' f\n' b'= \n f\n' b'= \n f\n' ' f\n'

The RFC recommends that a trailing = be preserved, but that trailing whitespace be ignored. It doesn't speak directly to the ==41 case, but one can infer that the first = in the == pair is most likely to have "come from the source text" and not been encoded, while the =41 was an intentional encoding and so should be decoded.

Now, that said, the actual behavior that our libraries have had for a long time is to treat the "last line" just like all other lines, and strip a trailing =. So I would be inclined to keep that behavior for backward compatibility reasons rather than change it to be more RFC compliant, given that we don't have any actual bug report related to it, and "fixing" it could break things. Given that, the current quoprimime behavior becomes the reference.

However, backward compatibility concerns also arise around starting to strip trailing space in quopri and binascii. Maybe we only make that change in 3.8?

serhiy-storchaka commented 6 years ago

Many thanks David! But sorry, your table confused me. I can't read it. Could you please reformat it?

bitdancer commented 6 years ago

I should have just deleted the table, actually.

The only important info in it is that per RFC '=', '=\n', and '= \n' all ought to become '='. But I don't think we should make that change, I think we should continue to turn those into ''. So I consider the (current!) bwehavior of quoprimime to be the correct behavior.

I also gave the example of '= \n foo\n', to show that quopri and binascii aren't stripping trailing blanks, as Martin noted in the other issue. They fold lines if they see '=\n', but not if they see '= \n', which is wrong per the (email!) RFC. I'm not clear if it is wrong for non-email uses of quopric, I haven't tried to research that.

python / cpython

Inconsistency between quopri.decodestring() and email.quoprimime.decode() #62222