python / cpython

The Python programming language
https://www.python.org
Other
63.42k stars 30.37k forks source link

email parser fails to decode quoted-printable rfc822 message attachemnt #89229

Open 25ef0b57-2469-41bc-9908-c38e731e7b12 opened 3 years ago

25ef0b57-2469-41bc-9908-c38e731e7b12 commented 3 years ago
BPO 45066
Nosy @warsaw, @bitdancer, @DiddiLeija, @anarcat

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['expert-email', 'type-crash', '3.9'] title = 'email parser fails to decode quoted-printable rfc822 message attachemnt' updated_at = user = 'https://github.com/anarcat' ``` bugs.python.org fields: ```python activity = actor = 'DiddiLeija' assignee = 'none' closed = False closed_date = None closer = None components = ['email'] creation = creator = 'anarcat' dependencies = [] files = [] hgrepos = [] issue_num = 45066 keywords = [] message_count = 2.0 messages = ['400764', '400767'] nosy_count = 4.0 nosy_names = ['barry', 'r.david.murray', 'DiddiLeija', 'anarcat'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'crash' url = 'https://bugs.python.org/issue45066' versions = ['Python 3.9'] ```

25ef0b57-2469-41bc-9908-c38e731e7b12 commented 3 years ago

If an email message has a message/rfc822 part *and* that part is quoted-printable encoded, Python freaks out.

Consider this code:

import email.parser
import email.policy

# python 3.9.2 cannot decode this message, it fails with # "email.errors.StartBoundaryNotFoundDefect"

mail = """Mime-Version: 1.0
Content-Type: multipart/report;
 boundary=aaaaaa
Content-Transfer-Encoding: 7bit

--aaaaaa Content-Type: message/rfc822 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline

MIME-Version: 1.0 Content-Type: multipart/alternative; boundary=3D"=3Dbbbbbb"

--=3Dbbbbbb Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=3Dutf-8

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx=
x

--=3Dbbbbbb--

--aaaaaa-- """

msg_abuse = email.parser.Parser(policy=email.policy.default + email.policy.strict).parsestr(mail)

That crashes with: email.errors.StartBoundaryNotFoundDefect

This should normally work: the sub-message is valid, assuming you decode the content. But if you do not, you end up in this bizarre situation, because the multipart boundary is probably considered to be something like 3D"=3Dbbbbbb", and of course the above code crashes with the above exception.

If you remove the quoted-printable part from the equation, the parser actually behaves:

import email.parser
import email.policy

# python 3.9.2 cannot decode this message, it fails with # "email.errors.StartBoundaryNotFoundDefect"

mail = """Mime-Version: 1.0
Content-Type: multipart/report;
 boundary=aaaaaa
Content-Transfer-Encoding: 7bit

--aaaaaa Content-Type: message/rfc822 Content-Disposition: inline

MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="=bbbbbb"

--=bbbbbb Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=utf-8

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

--=bbbbbb--

--aaaaaa-- """

msg_abuse = email.parser.Parser(policy=email.policy.default + email.policy.strict).parsestr(mail)

The above correctly parses the message.

This problem causes all sorts of weird issues. In one real-world example, it would just stop parsing headers inside the email because long lines in headers (typical in Received-by headers) would get broken up... So it would not actually fail completely. Or, to be more accurate, by *default* (ie. if you do not use strict), it does not crash and instead produces invalid data (e.g. a message without a Message-ID or From).

On most messages that are encoded this way, the strict mode will actually fail with: email.errors.MissingHeaderBodySeparatorDefect because it will stumble upon a header line that should be a continuation but instead is treated like a full header line, so it's missing a colon (":").

25ef0b57-2469-41bc-9908-c38e731e7b12 commented 3 years ago

looking at email.feedparser.FeedParser._parse_gen(), it looks like this is going to be really hard to fix, because the parser just happily recurses into the sub-part without ever checking the CTE (content-transfer-encoding). that's typically only done on "get_payload()", which is obviously not called there because we're streaming the email in.

in general, it looks like support for quoted-printable, as a CTE (which is https://datatracker.ietf.org/doc/html/rfc2045#section-6.7), seems to be spotty at best. multipart/ parts will raise the (undocumented) exception InvalidMultipartContentTransferEncodingDefect if they encounter it, for example:

https://github.com/python/cpython/blob/3.9/Lib/email/feedparser.py#L322

so I'm not sure how to handle this. it's not clear to me either how to workaround this problem at all... is there a way to keep the parser from recursing like this?

wbolster commented 2 years ago

it is expressly forbidden to use any non-trivial Content-Transfer-Encoding such as quoted-printable or base64 in message/rfc822 MIM parts; see RFC §5.2.1

No encoding other than "7bit", "8bit", or "binary" is permitted for the body of a "message/rfc822" entity. The message header fields are always US-ASCII in any case, and data within the body can still be encoded, in which case the Content-Transfer-Encoding header field in the encapsulated message will reflect this. Non-US-ASCII text in the headers of an encapsulated message can be specified using the mechanisms described in RFC 2047.