Open medmunds opened 4 months ago
Really interesting issue Mike. Thanks to you and Andre for the example - really cunning bit of Base 64 there :).
Perhaps I'm being naive, but it might save time or at least give a work around to try to solve this at a high level first. So may I ask if you're sure the policies are consistent? email.policy.default
is actually email.policy.EmailPolicy()
, i.e. not compat32
https://github.com/python/cpython/blob/main/Lib/email/policy.py#L226
I don't know if the bottom line of the following is a valid RFC 2047 header. But there's no unquoted comma in it, and the base 64 is unchanged (not decoded):
import email.message
import email.policy
message = email.message.Message() # (with default policy=compat32)
message["To"] = s = '=?utf-8?b?TmfGsOG7nWkgbmjhuq1uIGEgdmVyeSB2ZXJ5IGxvbmcs?= name <to@example.com>'
assert message.as_bytes(policy=email.policy.default) == b'To: =?utf-8?b?TmfGsOG7nWkgbmjhuq1u?= a very very long, name <to@example.com>\n\n'
# explictly use same policy
assert message.as_bytes(policy=email.policy.compat32) == b'To: =?utf-8?b?TmfGsOG7nWkgbmjhuq1uIGEgdmVyeSB2ZXJ5IGxvbmcs?= name\n <to@example.com>\n\n'
Nonetheless, the default policy to email.message.Message.__init__
, is not email.policy.default
. If it's not possible to make the two consistent the obvious way without a breaking change to the API, then that difference should be made clear in the docs if it isn't already.
Also consistently specifying email.policy.default
reproduces the issue too, in Message and EmailMessage, and in Message.as_string, so there is room for improvement:
m2 = email.message.Message(policy=email.policy.default)
m2["To"] = s
assert m2.as_bytes(policy=email.policy.default) == b'To: =?utf-8?b?TmfGsOG7nWkgbmjhuq1u?= a very very long, name <to@example.com>\n\n'
em = email.message.EmailMessage(policy=email.policy.default)
em["To"] = s
assert em.as_bytes(policy=email.policy.default) == b'To: =?utf-8?b?TmfGsOG7nWkgbmjhuq1u?= a very very long, name <to@example.com>\n\n'
I'm sure there's code there that parses and decodes the To: header correctly, as curiously the string representation of the UniqueAddressHeader
from header_store_parse looks fine, to me at least (it has the errant comma in the very very long name quoted). I'm wondering if either the correct code isn't being called correctly in the first place, or if the call to fold is parsing it again when it ought not to be. But something else probably relies on that, e.g. that needs the raw input.
import email.headerregistry
HR = email.headerregistry.HeaderRegistry
To = HR()['to']
assert str(To('to', s)) == '"Người nhận a very very long, name" <to@example.com>'
Thanks for the analysis, James.
So may I ask if you're sure the policies are consistent? … the default policy to email.message.Message.__init__, is not email.policy.default.
Sorry, yes, I used "default" confusingly in a code comment. I was trying to convey that the policy of an email.message.Message defaults to email.policy.compat32. (I've edited my original report to clarify.)
And I originally thought the root of this problem was using compat32 to build the message, then email.policy.default in the generator.
But your analysis seems to show that compat32 doesn't need to be involved, and this is actually a bug with email.policy.default header refolding in the generator. Here's an example using another problematic encoded-word that Andrés identified and sticking entirely to email.policy.default:
em2 = email.message.EmailMessage() # (uses policy=email.policy.default)
em2["To"] = '=?utf-8?q?A_v=C3=A9ry_long_name_with_non-ASCII_char_and=2C_comma?= <to@example.com>'
em2["To"]
# '"A véry long name with non-ASCII char and, comma" <to@example.com>'
em2.as_string() # (uses em2.policy, which is email.policy.default)
# 'To: A =?utf-8?q?v=C3=A9ry?= long name with non-ASCII char and, comma\n <to@example.com>\n\n'
The unquoted comma is still invalid. (Pretty sure the wrapping between display-name and addr-spec is allowed, though it might be discouraged in a "should not" sense.)
The bug shows up whether the message is constructed or parsed:
from email import message_from_string
em3 = message_from_string(
"To: =?utf-8?q?A_v=C3=A9ry_long_name_with_non-ASCII_char_and=2C_comma?= <to@example.com>\r\n\r\n",
policy=email.policy.default,
)
em3.as_string()
# 'To: A =?utf-8?q?v=C3=A9ry?= long name with non-ASCII char and, comma\n <to@example.com>\n\n'
But if the header is already wrapped before parsing (so that it doesn't require refolding in the generator), the problem goes away:
em4 = message_from_string(
"To: =?utf-8?q?A_v=C3=A9ry_long_name_with_non-ASCII_char_and=2C_comma?="
+ "\r\n <to@example.com>\r\n\r\n", # note header wrapping
policy=email.policy.default,
)
em4.as_string()
# 'To: =?utf-8?q?A_v=C3=A9ry_long_name_with_non-ASCII_char_and=2C_comma?=\n <to@example.com>\n\n'
I don't know if the bottom line of the following is a valid RFC 2047 header…:
b'To: =?utf-8?b?TmfGsOG7nWkgbmjhuq1uIGEgdmVyeSB2ZXJ5IGxvbmcs?= name\n <to@example.com>\n\n'
I believe that is valid. Pretty sure you can freely mix encoded-words, atoms, and quoted-strings. (Whether all email clients handle it properly is a different question.)
As far as a high-level workaround, the original report is actually a case isolated from Django, which has a lot of code built on the legacy email API combined with its own workarounds for historic issues in that API. Some of those workarounds tend to generate these problematic (but technically valid) RFC 2047 encoded-word headers. The best workaround seems to be not pre-encoding the headers, and then sticking entirely to the modern policy=default API (to avoid other legacy issues that necessitated pre-encoding). I'm working on updating Django for that.
But bottom line, it seems like there is a bug here in email.policy.default header refolding.
Also, this seems to be a rare case where compat32 works better than email.policy.default:
import email.message
import email.policy
import email.utils
display_name = "A véry long name with non-ASCII char and, comma"
to = email.utils.formataddr((display_name, "to@example.com"))
# '=?utf-8?q?A_v=C3=A9ry_long_name_with_non-ASCII_char_and=2C_comma?= <to@example.com>'
msg = email.message.EmailMessage(policy=email.policy.default)
msg["To"] = to
msg.as_string(policy=email.policy.default) # invalid header:
# 'To: A =?utf-8?q?v=C3=A9ry?= long name with non-ASCII char and, comma\n <to@example.com>\n\n'
msg.as_string(policy=email.policy.compat32) # valid header:
# 'To: =?utf-8?q?A_v=C3=A9ry_long_name_with_non-ASCII_char_and=2C_comma?=\n <to@example.com>\n\n'
And I'm now convinced this is the same underlying problem as #80222 (which focuses on email addresses in the display-name, and has an abandoned PR).
Thanks Mike. I agree this is definitely a real bug, and it potentially affects even more code (not just bytes) than your example - thanks for the report.
Unfortunately from my dip in to the code base briefly today, I think this is better addressed by someone with more knowledge than me, of all the things the email library has already fixed. Considerable effort has gone into it beforehand, and it does a lot of things very well already for a lot of people. I don't want to break any of that.
While this is related to #80222, it turns out to be a different issue. But they both have the same security implications.
The problem here is that _refold_parse_tree() (in email._header_value_parser) can remove RFC 2047 encoding from parts of a parsed encoded-word as it refolds a header. If that encoded-word was in a structured header (like an address field), removing the encoding can result in leaking 'specials' (like commas) into the header without proper quoting.
Since _refold_parse_tree() doesn't know whether it's working in a structured or unstructured header, the easiest fix seems to be retaining any existing RFC 2047 encoding on text that contains a 'specials' character. I've opened PR #122754 with that change.
[Incidentally, I reported this to the PSRT about a month ago, together with pointing out #80222 was still around. They agreed both problems seem to be security issues, and instructed me to handle the fixes publicly. I'm guessing since #80222 has already been public for five years, there's not much point in keeping the discussion private. Someone may want to add the #topic-email and #type-security labels to both issues.]
Bug report
Bug description:
If a (legacy email API) email.message.Message is assigned an address header that is pre-encoded with RFC 2047, calling as_bytes(policy=default) can generate an invalid address header. [edit: also affects modern email.message.EmailMessage—see comment]
Here is a minimal example to demonstrate the problem, isolated from much larger code (including Django's django.core.mail.message):
(The unquoted comma in the resulting display-name is not valid.)
For a real-world case where this can occur, see Django ticket 35378 and anymail/django-anymail#369. (Thanks to @andresmrm for noticing the problem and isolating a test case.)
Oddly, as_string(policy=default) doesn't exhibit the problem:
Also, the problem does not occur when assigning the non-encoded equivalent to the header:
(Possibly related to #80222)
CPython versions tested on:
3.9, 3.12
Operating systems tested on:
macOS
[edits: removed ambiguous use of "default" in example comment; clarified this is not a real-world example, but a minimal test case]
Linked PRs