jwhitlock commented 2 months ago

Bug report

Bug description:

I'm not sure if this is a bug, feature request, or user error. I'm happy to re-file once I know which

If a parsed email header contains a correctly quoted newline, setting an email header to that value will include a newline.

from email import message_from_string
from email.policy import default

email_in = """\
To: incoming+tag@me.example.com
From: External Sender <sender@them.example.com>
Subject: Here's an =?UTF-8?Q?embedded_newline=0A?=
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

<html>
<head><title>An embeded newline</title></head>
<body>
  <p>I sent you an embedded newline in the subject. How do you like that?!</p>
</body>
</html>
"""

msg = message_from_string(email_in, policy=default)
msg = message_from_string(email_in, policy=default)
for header, value in msg.items():
    del msg[header]
    msg[header] = value
email_out = str(msg)
print(email_out)

Output is:

To: incoming+tag@me.example.com
From: External Sender <sender@them.example.com>
Subject: Here's an embedded newline

Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

<html>
<head><title>An embeded newline</title></head>
<body>
  <p>I sent you an embedded newline in the subject. How do you like that?!</p>
</body>
</html>

An email parser will interpret the newline as the start of the message. In this case, the Content-Type and other MIME headers will not be processed, and the email treated as plain text. In other cases, required headers like To may not be processed and the email will not be delivered.

I'd expect an error on setting the value, an error on serializing the EmailMessage to a string, the subject to retain the original encoding, or the newline to be quoted in the serialized version.

Now that we know the behavior, we can process the headers (embed or strip trailing newlines). However, you may see this is a bug, a needed feature, or missing documentation.

More info:

subject's type is a email.headerregistry._UniqueUnstructuredHeader. It has a name, so it is assigned without checking (email.policy.EmailPolicy.header_store_parse()).

The _parse_tree, returned by email._header_value_parser.get_unstructured(), is:

UnstructuredTokenList([ValueTerminal("Here's"), WhiteSpaceTerminal(' '), ValueTerminal('an'), WhiteSpaceTerminal(' '), EncodedWord([ValueTerminal('embedded'), WhiteSpaceTerminal(' '), ValueTerminal('newline\n')])])

A user encountered this for our email relaying service https://relay.firefox.com (https://github.com/mozilla/fx-private-relay/issues/4841). An incoming email to a service address is matched to a user. We re-write the email headers and forward the email to the user's "real" address.

A real email has this subject header:

Subject: The All Over Piercings Wishlist of =?UTF-8?Q?John=2E=0A?=

This is from a European website https://www.alloverpiercings.com. You can create a wishlist and send it to an email address. The subject appears correctly encoded to me, to allow for non-ASCII usernames, with the unfortunate embedded newline. When forwarding this email, using something similar to the code above (but with more header modifications and additions), the embedded newline is turned into a real newline. The rest of the email headers are treated as part of the body. Since the Content-Type and other MIME headers are not processed as headers, the email is treated as a plain text email.

CPython versions tested on:

3.11, 3.12

Operating systems tested on:

macOS

Linked PRs

gh-121812
gh-122233
gh-122484
gh-122599
gh-122608
gh-122609
gh-122610
gh-122611

jwhitlock commented 2 months ago

On further investigation, a plain string with a trailing newline has this issue:

email["Subject"] = "string with newlines\n"

So the "re-use parsed header" is not part of the issue. The problem might be the newline detection in header_store_parse:

https://github.com/python/cpython/blob/dc03ce797ae8786a9711e6ee5dcaadde02c55864/Lib/email/policy.py#L131-L148

A single element list is returned by "string with newlines\n".splitlines(), so it can't detect a trailing newline.

ZeroIntensity commented 2 months ago

This is a bug (I was able to reproduce this on the CPython main branch), and looks like a minor security problem, considering this:

An email parser will interpret the newline as the start of the message.

For example, I could see someone developing an app that does something like this:

def email_notification(name: str):
    msg = EmailMessage()
    msg.set_content("This is an automatic notification blah blah blah...")
    msg["Subject"] = (
        f"{name} sent you a message!"
    )
    smtp_server.send_message(msg)

If a user set their name to something like "=?UTF-8?Q?=0A?==?UTF-8?Q?=0A?=This comes before the actual body!", then This comes before the actual body! would precedent the rest of the message. (FWIW, I'm not a security researcher nor a cybersecurity expert, this is speculative.)

Furthermore, you could use this to inject extra message headers.

basbloemsaat commented 2 months ago

It seems to be a bug, or two even.

msg = email.message.EmailMessage(policy=default)
msg['Subject'] = 'A 💩 subject\nBcc: injected@example.com'
print(str(msg))

The above throws a ValueError("Header values may not contain linefeed or carriage return characters"), as expected.

However the following does not, and inserts an extra newline, thus invalidating some headers:

msg = email.message.EmailMessage(policy=default)
msg['Subject'] = 'A 💩 subject\n'
msg.set_content('This is 💩 the body of the message.\n')
print(str(msg))

and by using an utf8 encoded newline, it even inserts an extra header

msg = email.message.EmailMessage(policy=default)
msg['Subject'] = 'A 💩 subject=?UTF-8?Q?=0A?=Bcc: injected@example.com'
msg.set_content('This is 💩 the body of the message.\n')
print(str(msg))

.

So, I think two things have to be solved:

newlines at the end should either throw a ValueError, like in the middle, or be stripped, as they are not allowed by the rfc
encoded newlines should also throw a ValueError.

@encukou : I'll try to fix both during (or after) the EuroPython sprint, ok?

jwhitlock commented 1 month ago

Thanks @basbloemsaat. Feel free to pick a better title for this issue (or suggest one if I need to change it), or re-file for the individual issues.

ZeroIntensity commented 1 month ago

I'm pretty sure this is a security problem, as you can inject extra headers. @Eclips4 what do you think, and could you add the security label?

Eclips4 commented 1 month ago

I would like to hear @serhiy-storchaka opinion on this.

medmunds commented 1 month ago

121284 turns out to be a variation of this, where refolding a parsed RFC 2047 encoded-word can leak 'specials' characters into structured headers without proper quoting/encoding. The security issue is not quite as severe as letting newlines leak in, but unquoted specials can allow manipulation of the message sender and recipients.

encukou commented 3 days ago

Thank you @jwhitlock for the report, and @basbloemsaat for the initial fix!

python / cpython

email.policy.default - gotcha with re-using parsed headers with embedded newlines #121650

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

Linked PRs