python / cpython

The Python programming language
https://www.python.org
Other
63.36k stars 30.34k forks source link

Parsing adds whitespace to the front of long headers #124452

Open calpaterson opened 1 month ago

calpaterson commented 1 month ago

Bug report

Bug description:

When parsing back a written email, whitespace seems to be prepended to the header if the header was wrapped upon writing.

This is particularly noticeable for message-ids, which end up different - with either a space or a newline prepended depending on what policy is set to (compat32: newline, default: space).

import string

from email import message_from_bytes
from email.message import EmailMessage
import email.policy

orig = EmailMessage()
orig["Message-ID"] = string.ascii_lowercase * 3
policy = email.policy.default  # changing to compat32 emits a different error
parsed = message_from_bytes(orig.as_bytes(policy=policy), policy=policy)
assert (
    parsed["Message-ID"] == orig["Message-ID"]
), f"message ids don't match: '{orig['Message-ID']}' != '{parsed['Message-ID']}'"

I'm not very familiar with RFC2822, but based on the rules it includes for "long" header fields, the written email bytes look right to me, it's just when it's being read back it's not right.

CPython versions tested on:

3.9, 3.12

Operating systems tested on:

Linux

Linked PRs

rruuaanng commented 1 month ago

Bug report

Bug description:

When parsing back a written email, whitespace seems to be prepended to the header if the header was wrapped upon writing.

This is particularly noticeable for message-ids, which end up different - with either a space or a newline prepended depending on what policy is set to (compat32: newline, default: space).

import string

from email import message_from_bytes
from email.message import EmailMessage
import email.policy

orig = EmailMessage()
orig["Message-ID"] = string.ascii_lowercase * 3
policy = email.policy.default  # changing to compat32 emits a different error
parsed = message_from_bytes(orig.as_bytes(policy=policy), policy=policy)
assert (
    parsed["Message-ID"] == orig["Message-ID"]
), f"message ids don't match: '{orig['Message-ID']}' != '{parsed['Message-ID']}'"

I'm not very familiar with RFC2822, but based on the rules it includes for "long" header fields, the written email bytes look right to me, it's just when it's being read back it's not right.

CPython versions tested on:

3.9, 3.12

Operating systems tested on:

Linux

You may be referring to RFC822. But the above behavior is indeed wrong. Maybe you can add .strip(' ') after the parsing process.

bitdancer commented 1 week ago

Well, there are two problems here. One is the wrapping on serialization. The original design was supposed to be that when the word was too long to fit within the maxlength limit, encoded words would be used to do the wrapping. Not sure whether that was a good choice or not, or why it isn't happening here, unless someone "fixed" that design? So, two choices, either make it so the longer-than-maxlength word doesn't cause wrapping, or fix it so that encoded words are used and the line gets wrapped to fit correctly within maxline.

However, that is not the bug you are addressing here, so it should go into another issue if you want to open one. (You could also just ignore it).

Then, there is the parsing problem. That leading space on the next line is supposed to be treated as if it were the space between the ':' and the body of the header. As I noted on the PR review, the problem is that I failed to include newline and carriage return as part of the whitespace to be stripped from the start of the value.

bitdancer commented 1 week ago

Now I remember. There was a previous bug where long message ids were getting encoded using encoded words, which is not legal per the rfc. We fixed that bug, but didn't deal with the long-word-gets-moved-to-next-line bug at that time.