python / cpython

The Python programming language
https://www.python.org
Other
62.27k stars 29.92k forks source link

email as_bytes() corruption of headers with SMTPUTF8 policy #99927

Open mricon opened 1 year ago

mricon commented 1 year ago

Bug report

Reading and then writing RFC2822 messages with SMTPUTF8 policy results in header corruption.

import email
import email.policy

odata = '''From: Unicôde Nâme <example@example.com>
To: Name One <name-one@example.com>, Name Two <name-two@example.com>, Unicôde Nàme <unicode-name@example.com>
Subject: Unicôde Subjéct

This message contains Uniçôde Çôntent.
'''

msg = email.message_from_bytes(odata.encode(), policy=email.policy.SMTPUTF8)

ndata = msg.as_bytes(policy=email.policy.SMTPUTF8)
print(ndata.decode())

Expected output:

From: Unicôde Nâme <example@example.com>
To: Name One <name-one@example.com>, Name Two <name-two@example.com>,
 Unicôde Nàme <unicode-name@example.com>
Subject: Unicôde Subjéct

This message contains Uniçôde Çôntent.

Actual output:

From: Unicôde Nâme <example@example.com>
To: Name One <name-one@example.com>, Name Two <name-two@example.com>,
 =?unknown-8bit?q?Unic=C3=B4de_N=C3=A0me?= <unicode-name@example.com>
Subject: Unicôde Subjéct

This message contains Uniçôde Çôntent.

Your environment

bmorg commented 2 months ago

I think I am running into the same problem using the mailbox module, which internally uses email.message_from_binary_file.

To reproduce:

Sample email file, UTF8-encoded:

Date: Fri, 28 Jun 2024 15:31:43 +0200
MIME-Version: 1.0
To: test@ümlaut.example
Subject: Test
Content-Type: text/plain; charset=UTF-8; format=flowed

Message

Code:

>>> f = open('umlaut-email', 'rb')
>>> msg = email.message_from_binary_file(f)
>>> msg['To'].encode()
'=?unknown-8bit?q?test=40=C3=BCmlaut=2Eexample?='

Edit: This is Python 3.12.2 on macOS.

bmorg commented 2 months ago

Update: I seem to have misunderstood the policy handling. I thought it would be detected automatically. When specifying it manually, I get the expected result:

>>> msg = email.message_from_binary_file(f, policy=email.policy.SMTPUTF8)
>>> msg['To'].encode()
>>> b'test@\xc3\xbcmlaut.example'

Which makes me wonder if the mailbox module can handle SMTPUTF8 at all, since it never seems to specify a policy.