email._header_value_parser does not recognise in-line encoding changes

42624147-0ecb-4866-950c-a39f74db61aa commented 10 years ago

BPO	21315
Nosy	@warsaw, @bitdancer, @maxking, @miss-islington
PRs	python/cpython#13425 python/cpython#13846 python/cpython#15655
Files	000359.raw: Example bugzilla e-mail unstructured_ew_without_whitespace.diff: Unit test & possible fix

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = created_at = labels = ['3.8', 'type-bug', '3.7', 'expert-email'] title = 'email._header_value_parser does not recognise in-line encoding changes' updated_at = user = 'https://bugs.python.org/valhallasw' ``` bugs.python.org fields: ```python activity = actor = 'miss-islington' assignee = 'none' closed = True closed_date = closer = 'maxking' components = ['email'] creation = creator = 'valhallasw' dependencies = [] files = ['34984', '34985'] hgrepos = [] issue_num = 21315 keywords = ['patch'] message_count = 12.0 messages = ['216908', '238956', '342739', '342745', '342750', '342755', '342756', '343070', '343271', '344750', '344841', '351091'] nosy_count = 5.0 nosy_names = ['barry', 'r.david.murray', 'valhallasw', 'maxking', 'miss-islington'] pr_nums = ['13425', '13846', '15655'] priority = 'normal' resolution = None stage = 'resolved' status = 'closed' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue21315' versions = ['Python 3.7', 'Python 3.8'] ```

42624147-0ecb-4866-950c-a39f74db61aa commented 10 years ago

Bugzilla sends e-mail in a format where =?UTF-8 is not preceded by whitespace. This makes email.headerregistry.UnstructuredHeader (and email._header_value_parser on the background) not recognise the structure.

>>> import email.headerregistry, pprint
>>> x = {}; email.headerregistry.UnstructuredHeader.parse('[Bug 64155]\tNew:=?UTF-8?Q?=20non=2Dascii=20bug=20t=C3=A9st?=;\trussian text:=?UTF-8?Q?=20=D0=90=D0=91=D0=92=D0=93=D2=90=D0=94', x); pprint.pprint(x)
{'decoded': '[Bug 64155]\tNew:=?UTF-8?Q?=20non=2Dascii=20bug=20t=C3=A9st?=;\t'
            'russian text:=?UTF-8?Q?=20=D0=90=D0=91=D0=92=D0=93=D2=90=D0=94',
 'parse_tree': UnstructuredTokenList([ValueTerminal('[Bug'), WhiteSpaceTerminal(' '), ValueTerminal('64155]'), WhiteSpaceTerminal('\t'), ValueTerminal('New:=?UTF-8?Q?=20non=2Dascii=20bug=20t=C3=A9st?=;'), WhiteSpaceTerminal('\t'), ValueTerminal('russian'), WhiteSpaceTerminal(' '), ValueTerminal('text:=?UTF-8?Q?=20=D0=90=D0=91=D0=92=D0=93=D2=90=D0=94')])}

versus

>>> x = {}; email.headerregistry.UnstructuredHeader.parse('[Bug 64155]\tNew: =?UTF-8?Q?=20non=2Dascii=20bug=20t=C3=A9st?=;\trussian text: =?UTF-8?Q?=20=D0=90=D0=91=D0=92=D0=93=D2=90=D0=94', x); pprint.pprint(x)
{'decoded': '[Bug 64155]\tNew:  non-ascii bug tést;\trussian text:  АБВГҐД',
 'parse_tree': UnstructuredTokenList([ValueTerminal('[Bug'), WhiteSpaceTerminal(' '), ValueTerminal('64155]'), WhiteSpaceTerminal('\t'), ValueTerminal('New:'), WhiteSpaceTerminal(' '), EncodedWord([WhiteSpaceTerminal(' '), ValueTerminal('non-ascii'), WhiteSpaceTerminal(' '), ValueTerminal('bug'), WhiteSpaceTerminal(' '), ValueTerminal('tést')]), ValueTerminal(';'), WhiteSpaceTerminal('\t'), ValueTerminal('russian'), WhiteSpaceTerminal(' '), ValueTerminal('text:'), WhiteSpaceTerminal(' '), EncodedWord([WhiteSpaceTerminal(' '), ValueTerminal('АБВГҐД')])])}

I have attached the raw e-mail as attachment.

Judging by the code, this is supposed to work (while raising a Defect -- "missing whitespace before encoded word"), but the code splits by whitespace:

tok, *remainder = _wsp_splitter(value, 1)

which swallows the encoded section in one go. In a second attachment, I added a patch which 1) adds a test case for this and 2) implements a solution, but the solution is unfortunately not in the style of the rest of the module.

In the meanwhile, I've chosen a monkey-patching approach to work around the issue:

import email._header_value_parser, email.headerregistry
def get_unstructured(value):
    value = value.replace("=?UTF-8?Q?=20", " =?UTF-8?Q?")
    return email._header_value_parser.get_unstructured(value)
email.headerregistry.UnstructuredHeader.value_parser = staticmethod(get_unstructured)

83d2e70e-e599-4a04-b820-3814bbdb9bef commented 9 years ago

Could someone formally review the patch please, it's only three additional lines of code and a new test.

maxking commented 5 years ago

According to RFC 2047 5(1)

However, an 'encoded-word' that appears in a header field defined as '*text' MUST be separated from any adjacent 'encoded-word' or 'text' by 'linear-white-space'.

So, it seems like splitting on whitespace is the right thing to do (see MUST).

While your solution works for your case where the charset and cte are utf-8 and q respectively (not a general case for random chatsets and cte), it seems like a hack to get around the fact the header is non-conformant to RFC.

IMO manipulating the original header (value.replace in your patch) isn't something we should do, but @r.david.murray would be the right person to answer how we handle non-conformant messages.

bitdancer commented 5 years ago

A cleaner/safer solution here would be:

tok, *remainder = _wsp_splitter(value, 1) if _rfc2047_matcher(tok): tok, *remainder = value.partition('=?')

where _rfc2047_matcher would be a regex that matches a correctly formatted encoded word. There a regex for that in the header.py module, though for this application we don't need the groups it has.

Abhilash, I'm not sure why you say the proposed solution only works for utf-8 and 'q'?

maxking commented 5 years ago

The solution replaces RFC 20147 chrome for utf-8 and q to make sure there is a space before ew, it wouldn't replace in case of any other charset/cte pair.

    value = value.replace("=?UTF-8?Q?=20", " =?UTF-8?Q?")

Isn't that correct?

bitdancer commented 5 years ago

I don't see that line of code in unstructured_ew_without_whitespace.diff.

Oh, you are referring to his monkey patch. Yes, that is not a suitable solution for anyone but him, and I don't think he meant to imply otherwise :)

maxking commented 5 years ago

Ah, I wrongly assumed the patch had the same thing.

Sorry about that.

maxking commented 5 years ago

Created a Pull Request for this.

https://github.com/python/cpython/pull/13425

maxking commented 5 years ago

I have made the requested changes on PR.

warsaw commented 5 years ago

New changeset 66c4f3f38b867d8329b28c032bb907fd1a2f22d2 by Barry Warsaw (Abhilash Raj) in branch 'master': bpo-21315: Fix parsing of encoded words with missing leading ws. (bpo-13425) https://github.com/python/cpython/commit/66c4f3f38b867d8329b28c032bb907fd1a2f22d2

warsaw commented 5 years ago

New changeset dc20fc4311dece19488299a7cd11317ffbe4d3c3 by Barry Warsaw (Miss Islington (bot)) in branch '3.7': bpo-21315: Fix parsing of encoded words with missing leading ws. (GH-13425) (bpo-13846) https://github.com/python/cpython/commit/dc20fc4311dece19488299a7cd11317ffbe4d3c3

miss-islington commented 5 years ago

New changeset 59e8fba7189d0e86d428a1125744afb8b0f40b5d by Miss Islington (bot) (Ashwin Ramaswami) in branch '3.8': [3.8] bpo-21315: Fix parsing of encoded words with missing leading ws (GH-13425) (GH-15655) https://github.com/python/cpython/commit/59e8fba7189d0e86d428a1125744afb8b0f40b5d

python / cpython

email._header_value_parser does not recognise in-line encoding changes #65514