email.parser / email.policy does not correctly handle multiple RFC2047 encoded-word tokens across RFC5322 folded headers

652782da-b5f3-46b2-ae1b-1f73662bb759 commented 5 years ago

BPO	35547
Nosy	@warsaw, @mjpieters, @bitdancer, @websurfer5

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-bug', '3.7', 'expert-email'] title = 'email.parser / email.policy does not correctly handle multiple RFC2047 encoded-word tokens across RFC5322 folded headers' updated_at = user = 'https://github.com/mjpieters' ``` bugs.python.org fields: ```python activity = actor = 'Jeffrey.Kintscher' assignee = 'none' closed = False closed_date = None closer = None components = ['email'] creation = creator = 'mjpieters' dependencies = [] files = [] hgrepos = [] issue_num = 35547 keywords = [] message_count = 6.0 messages = ['332243', '332276', '332277', '332279', '332290', '332296'] nosy_count = 5.0 nosy_names = ['barry', 'mjpieters', 'r.david.murray', 'era', 'Jeffrey.Kintscher'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue35547' versions = ['Python 3.7'] ```

652782da-b5f3-46b2-ae1b-1f73662bb759 commented 5 years ago

The From header in the following email headers is not correctly decoded; both the subject and from headers contain UTF-8 encoded data encoded with RFC2047 encoded-words, in both cases a multi-byte UTF-8 codepoint has been split between the two encoded-word tokens:

>>> msgdata = '''\
From: =?utf-8?b?4ZuX4Zqr4Zqx4ZuP4ZuB4ZuD4Zq+4ZuI4ZuB4ZuW4ZuP4ZuW4Zo=?=
 =?utf-8?b?seGbiw==?= <martijn@example.com>
Subject: =?utf-8?b?c8qHdcSxb2THnXBvyZQgOC3ihLLiiqXiiKkgx53Kh8qOcS3E?=
 =?utf-8?b?scqHyoNuya8gyaXKh8Sxyo0gx53Gg8mQc3PHncmvIMqHc8edyocgybnHncaDdW/Kgw==?=
'''
>>> from io import StringIO
>>> from email.parser import Parser
>>> from email import policy
>>> msg = Parser(policy=policy.default).parse(StringIO(msgdata))
>>> print(msg['Subject'])  # correct
sʇuıodǝpoɔ 8-Ⅎ⊥∩ ǝʇʎq-ıʇʃnɯ ɥʇıʍ ǝƃɐssǝɯ ʇsǝʇ ɹǝƃuoʃ
>>> print(msg['From'])  # incorrect
ᛗᚫᚱᛏᛁᛃᚾᛈᛁᛖᛏᛖ� �ᛋ <martijn@example.com>

Note the two FFFD placeholders in the From line.

The issue is that the raw value of the From and Subject contain the folding space at the start of the continuation lines:

>>> for name, value in msg.raw_items():
...     if name in {'Subject', 'From'}:
...         print(name, repr(value))
...
From '=?utf-8?b?4ZuX4Zqr4Zqx4ZuP4ZuB4ZuD4Zq+4ZuI4ZuB4ZuW4ZuP4ZuW4Zo=?=\n =?utf-8?b?seGbiw==?= <martijn@example.com>'
Subject '=?utf-8?b?c8qHdcSxb2THnXBvyZQgOC3ihLLiiqXiiKkgx53Kh8qOcS3E?=\n =?utf-8?b?scqHyoNuya8gyaXKh8Sxyo0gx53Gg8mQc3PHncmvIMqHc8edyocgybnHncaDdW/Kgw==?='

For the Subject header, _header_value_parser.get_unstructured is used, which *expects* there to be spaces between encoded words; it inserts EWWhiteSpaceTerminal tokens in between which are turned into empty strings. But for the From header, AddressHeader parser does not, the space at the start of the line is retained, and the surrogate escapes at the end of one encoded-word and the start start of the next encoded-word never ajoin, so the later handling of turning surrogates back into proper data fails.

Since unstructured header parsing doesn't mind if a space is missing between encoded-word atoms, the work-around is to explicitly remove the space at the start of every line; this can be done in a custom policy:

import re
from email.policy import EmailPolicy

class UnfoldingHeaderEmailPolicy(EmailPolicy):
    def header_fetch_parse(self, name, value):
        # remove any leading whitespace from header lines
        # before further processing
        value = re.sub(r'(?<=[\n\r])([\t ])', '', value)
        return super().header_fetch_parse(name, value)

custom_policy = UnfoldingHeaderEmailPolicy()

after which the From header comes out without placeholders:

>>> msg = Parser(policy=custom_policy).parse(StringIO(msgdata))
>>> msg['from']
'ᛗᚫᚱᛏᛁᛃᚾᛈᛁᛖᛏᛖᚱᛋ <martijn@example.com>'
>>> msg['subject']
'sʇuıodǝpoɔ 8-Ⅎ⊥∩ ǝʇʎq-ıʇʃnɯ ɥʇıʍ ǝƃɐssǝɯ ʇsǝʇ ɹǝƃuoʃ'

This issue was found by way of https://stackoverflow.com/q/53868584/100297

652782da-b5f3-46b2-ae1b-1f73662bb759 commented 5 years ago

Right, re-educating myself on the MIME RFCs, and found https://bugs.python.org/issue1372770 where the same issue is being discussed for previous incarnations of the email library.

Removing the FWS after CRLF is the wrong thing to do, **unless** RFC2047 separating encoded-word tokens. The work-around regex is a bit more complicated, but ideally the EW handling should use a specialist FWS token to delimit encoded-word sections that renders to '' as is done in unstructured headers, but everywhere. Because in practice, there are email clients out there that use EW in structured headers, regardless.

Regex to work around this

# crude CRLF-FWS-between-encoded-word matching
value = re.sub(r'(?<=\?=(\r\n|\n|\r))([\t ]+)(?==\?)', '', value)

652782da-b5f3-46b2-ae1b-1f73662bb759 commented 5 years ago

That regex is incorrect, I should not post untested code from a mobile phone. Corrected workaround with more context:

import re
from email.policy import EmailPolicy

class UnfoldingEncodedStringHeaderPolicy(EmailPolicy):
    def header_fetch_parse(self, name, value):
        # remove any leading whitespace from header lines
        # that separates apparent encoded-word token before further processing 
        # using somewhat crude CRLF-FWS-between-encoded-word matching
        value = re.sub(r'(?<=\?=)((?:\r\n|[\r\n])[\t ]+)(?==\?)', '', value)
        return super().header_fetch_parse(name, value)

4e71a086-d468-4333-b7b0-fbd5381908f7 commented 5 years ago

I don't think this is a bug. My impression is that encoded words should be decodable in isolation.

652782da-b5f3-46b2-ae1b-1f73662bb759 commented 5 years ago

While RFC2047 clearly states that an encoder MUST not split multi-byte encodings in the middle of a character (section 5, "Each 'encoded-word' MUST represent an integral number of characters. A multi-octet character may not be split across adjacent 'encoded-word's.), it also states that to fit length restrictions, CRLF SPACE is used as a delimiter between encoded words (section 2, "If it is desirable to encode more text than will fit in an 'encoded-word' of 75 characters, multiple 'encoded-word's (separated by CRLF SPACE) may be used."). In section 6.2 it states

When displaying a particular header field that contains multiple 'encoded-word's, any 'linear-white-space' that separates a pair of adjacent 'encoded-word's is ignored. (This is to allow the use of multiple 'encoded-word's to represent long strings of unencoded text, without having to separate 'encoded-word's where spaces occur in the unencoded text.)

(linear-white-space is the RFC822 term for foldable whitespace).

The parser is leaving spaces between two encoded-word tokens in place, where it must remove them instead. And it is doing so correctly for unstructured headers, just not in get_bare_quoted_string, get_atom and get_dot_atom.

Then there is Postel's law (be liberal in what you accept from others), and the email package already applies that principle to RFC2047 elsewhere; RFC2047 also states that "An 'encoded-word' MUST NOT appear within a 'quoted-string'." yet email._header_value_parser's handling of quoted-string will process EW sections.

bitdancer commented 5 years ago

Here's a patch that makes the example work correctly. This is not a fix, a real fix will be more complicated. This just demonstrates the kind of thing that needs fixing and where.

The existing parser produces a sub-optimal parse tree as its result...the parse tree is hard to inspect and manipulate because there are so many special cases. A good fix here would create some sort of function that could be passed an existing TokenList, the new token to add to that list, and the function would check all the special cases and do the EWWhiteSpaceTerminal substitution when and as appropriate. This could then be used in the unstructured parser as well as Phrase...and some thought should be given to where else it might be needed. It has been long enough since I've held the RFCs in my head that I don't remember if there is anywhere else.

I haven't looked at the actual character string, so I don't know if we need to also be detecting and posting a defect about a split character or not, but we don't *have* to answer that question to fix this.

diff --git a/Lib/email/_header_value_parser.py b/Lib/email/_header_value_parser.py
index e805a75..d5d5986 100644
--- a/Lib/email/_header_value_parser.py
+++ b/Lib/email/_header_value_parser.py
@@ -199,6 +199,10 @@ class CFWSList(WhiteSpaceTokenList):

 class Atom(TokenList):

+    @property
+    def has_encoded_word(self):
+        return any(t.token_type=='encoded-word' for t in self)
+
     token_type = 'atom'

@@ -1382,6 +1386,12 @@ def get_phrase(value):
                         "comment found without atom"))
                 else:
                     raise
+            if token.has_encoded_word:
+                assert phrase[-1].token_type == 'atom', phrase[-1]
+                assert phrase[-1][-1].token_type == 'cfws'
+                assert phrase[-1][-1][-1].token_type == 'fws'
+                if phrase[-1].has_encoded_word:
+                    phrase[-1][-1] = EWWhiteSpaceTerminal(phrase[-1][-1][-1], 'fws')
             phrase.append(token)
     return phrase, value

fsc-eriker commented 8 months ago

This seems vaguely related to #109252

python / cpython

email.parser / email.policy does not correctly handle multiple RFC2047 encoded-word tokens across RFC5322 folded headers #79728