python / cpython

The Python programming language
https://www.python.org
Other
62.41k stars 29.96k forks source link

Email Header Folding Converts Non-CRLF Newlines to CRLFs #90620

Open 8a5fd93c-2f61-42bc-83cc-c28c8e7cd129 opened 2 years ago

8a5fd93c-2f61-42bc-83cc-c28c8e7cd129 commented 2 years ago
BPO 46462

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-bug', 'library', '3.11'] title = 'Email Header Folding Converts Non-CRLF Newlines to CRLFs' updated_at = user = 'https://bugs.python.org/jwalterclark' ``` bugs.python.org fields: ```python activity = actor = 'jwalterclark' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'jwalterclark' dependencies = [] files = [] hgrepos = [] issue_num = 46462 keywords = [] message_count = 1.0 messages = ['411171'] nosy_count = 1.0 nosy_names = ['jwalterclark'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue46462' versions = ['Python 3.11'] ```

8a5fd93c-2f61-42bc-83cc-c28c8e7cd129 commented 2 years ago

In various places in the email library str.splitlines is used to split up a message where folding might take place in the original message source. This appears to be a bug because when these split parts are re-joined they are joined by a CRLF. https://github.com/python/cpython/blob/ef5bb25e2d6147cd44be9c9b166525fb30485be0/Lib/email/header.py#L369

str.splitlines splits on "universal newlines" which can include newlines other than the CRLF. https://docs.python.org/3/library/stdtypes.html#str.splitlines

However, the email RFCs define folding whitespace with CRLF as the only possible newline type (optionally surrounded by WSP (SP/HTAB) and/or comments). https://datatracker.ietf.org/doc/html/rfc5322#section-3.2.2

The end result is that a message making a roundtrip through the email parser/generator is mangled because it has any non-CRLF "universal newlines" converted to CRLFs. Anything in the header after the non-CRLF "universal newline" appears on it's own line with no preceding whitespace. This appears to happen with all of the stock policies.

from email import message_from_bytes
from email.policy import SMTPUTF8

eml_bytes = b'Header-With-FS-Char: BEFORE\x1cAFTER\r\n\r\nBody\r\n'
print(eml_bytes)

message = message_from_bytes(eml_bytes, policy=SMTPUTF8)
print(message.as_bytes(policy=SMTPUTF8))
b'Header-With-FS-Char: BEFORE\x1cAFTER\r\n\r\nBody\r\n'
b'Header-With-FS-Char: BEFORE\r\nAFTER\r\n\r\nBody\r\n'

The operational impact of this mangling is that the "AFTER" text now makes the message format invalid because it is neither a valid header (no ": ") nor the valid start of a message body (only one CRLF). Common MIME-viewers (e.g. Thunderbird/Outlook) appear to interpret it as a body anyway and any subsequent headers become part of the body.