python / cpython

The Python programming language
https://www.python.org
Other
63.42k stars 30.37k forks source link

Email.quopriprime over-encodes characters #76479

Closed 42c26b8c-c27b-44c5-98bb-c1d848128f65 closed 6 years ago

42c26b8c-c27b-44c5-98bb-c1d848128f65 commented 6 years ago
BPO 32298
Nosy @bitdancer

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = created_at = labels = ['invalid', 'type-bug', 'library'] title = 'Email.quopriprime over-encodes characters' updated_at = user = 'https://bugs.python.org/gkuenning' ``` bugs.python.org fields: ```python activity = actor = 'gkuenning' assignee = 'none' closed = True closed_date = closer = 'r.david.murray' components = ['Library (Lib)'] creation = creator = 'gkuenning' dependencies = [] files = [] hgrepos = [] issue_num = 32298 keywords = [] message_count = 5.0 messages = ['308181', '308184', '308186', '308187', '308267'] nosy_count = 2.0 nosy_names = ['r.david.murray', 'gkuenning'] pr_nums = [] priority = 'normal' resolution = 'not a bug' stage = 'resolved' status = 'closed' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue32298' versions = ['Python 3.4'] ```

42c26b8c-c27b-44c5-98bb-c1d848128f65 commented 6 years ago

Email.quopriprime creates a map of header and body bytes that need no encoding:

for c in b'-!*+/' + ascii_letters.encode('ascii') + digits.encode('ascii'):
    _QUOPRI_HEADER_MAP[c] = chr(c)

This map is overly restrictive; in fact only two printable characters need to be omitted: the space and the equals sign. The following revision to the loop creates a correct table:

for c in list(range(33, 61)) + list(range(62, 127)):
    _QUOPRI_HEADER_MAP[c] = chr(c)

Why does this matter? Well, first, it's wasteful since it creates messages with larger headers than necessary. But more important, it makes it impossible for other tools to operate on the messages unless they're encoding aware; for example, one can't easily grep for "foo@bar.com" because the at sign is encoded as =40.

42c26b8c-c27b-44c5-98bb-c1d848128f65 commented 6 years ago

Oops, that loop is a bit too generous. Here's a better one:

for c in list(range(33, 61)) + [62] + list(range(64, 95)) + list(range(96,127)):
bitdancer commented 6 years ago

From RFC 2047:

(3) As a replacement for a 'word' entity within a 'phrase', for example, one that precedes an address in a From, To, or Cc header. The ABNF definition for 'phrase' from RFC 822 thus becomes:

    phrase = 1*( encoded-word / word )
In this case the set of characters that may be used in a "Q"-encoded
'encoded-word' is restricted to: <upper and lower case ASCII
letters, decimal digits, "!", "*", "+", "-", "/", "=", and "_"
(underscore, ASCII 95.)>.  An 'encoded-word' that appears within a
'phrase' MUST be separated from any adjacent 'word', 'text' or
'special' by 'linear-white-space'.

The reason for this is that things like '@' are syntactically significant in headers and so must be encoded.

bitdancer commented 6 years ago

And of course tools can grep for "foo@bar.com": you can't use encoded words in an address, only in the display name.

However, it occurs to me that in fact the restriction applies only to phrases, so one could use a less restrictive character set in an unstructured header such as the Subject, and that would indeed be nice. The old header folder (python 2.7 and python 3.x compat32 policy) can't do it, because they don't know anything about the syntax of the headers they fold, they just use a bunch of heuristics. The new policies in python3, however, use a smarter folder from _header_value_parser, and that *does* have access to the full parse tree for the header, and so could make smart decisions about which character set to use for the encoded word encoding.

If you'd like to try your hand at a PR implementing this idea, I'll be happy to provide advice and do a review. It's not going to be anywhere near as simple as the one line change you proposed here, though :)

42c26b8c-c27b-44c5-98bb-c1d848128f65 commented 6 years ago

I should have read that part of RFC 2047 before I submitted.

I'd love to claim that I'm going to write a patch that would do as you suggest. But the reality is that I'm unlikely to find the time, so I'm going to be wise for once and avoid promising what I can't deliver.