python / cpython

The Python programming language
https://www.python.org
Other
62.75k stars 30.07k forks source link

email module get_content() yields invalid UTF8 when CTE is 8bit #105285

Open dougmccasland opened 1 year ago

dougmccasland commented 1 year ago

Python 3.10.6 module email — An email and MIME handling package v3.11.3

Consider this simple message:

Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8-bit
MIME-Version: 1.0
From: Dmcc <foobar1@gmail.com>
To: Dmcc <foobar2@gmail.com>
Subject: test msg of 8bit CTE and UTF8

there is the hötel 

Notice the o-umlaut in the word hotel, this is encoded in utf8. I put this in a file called msg.eml. Then run this:

#!/usr/bin/env python3

import email
from email.policy import default    

f = open("msg.eml", "r")
msg = email.message_from_file(f, policy=default)  
f.close()
print('CTE: ', msg['content-transfer-encoding'])
body = msg.get_content()
print('body:', body)

The output:

CTE:  8-bit
body: there is the h�tel

I expect the output to have valid utf8 since the CTE is 8bit. This problem also hhappens with the older get_payload() and with any of the "_from" methods, such as email.message_from_bytes().

Linked PRs

michaelfm1211 commented 1 year ago

I tested this on the current main branch and got the same thing. However, I did not got an issue when using msg.get_payload().

The .get_content() method is determined by msg's policy's content manager. For the default policy, that's email.content_manager.raw_data_manager. For MIME types starting with text, raw_data_manager uses email.content_manager.get_text_content. get_text_content() calls msg.get_payload(decode=True) (which decodes the payload according to the CTE header), then decodes the result with the email's charset. Here's the source for get_text_content() (link):

def get_text_content(msg, errors='replace'):
    content = msg.get_payload(decode=True)
    charset = msg.get_param('charset', 'ASCII')
    return content.decode(charset, errors=errors)
raw_data_manager.add_get_handler('text', get_text_content)

I don't know too much about Unicode, but I think the issue is that if decode=True and the payload is a string (which it is in your case), then msg.get_payload() will first try to encode the payload with 'ascii', then fall back to raw-unicode-escape. Of course, the umlaut cannot be encoded with ASCII, so msg.get_payload() returns the payload encoded with raw-unicode-escape. Here's the source code for where that happens (link):

try:
    bpayload = payload.encode('ascii')
except UnicodeError:
    # This won't happen for RFC compliant messages (messages
    # containing only ASCII code points in the unicode input).
    # If it does happen, turn the string into bytes in a way
    # guaranteed not to fail.
    bpayload = payload.encode('raw-unicode-escape')

So when msg.get_payload(decode=True) returns something encoded with 'raw-unicode-escape', then get_text_content tries to decode it with a charset that's not 'raw-unicode-escape' (UTF-8 in your case), you get a UnicodeDecodeError, which becomes those weird symbols because errors='replace'.

To fix this, I think msg.get_payload(decode=True) should first try encoding with UTF-8 if the CTE is 8bit, then fall back to ASCII for any other CTE. Then if those fail it can fallback to raw Unicode escape.

dougmccasland commented 1 year ago

Very interesting.

Quite right, get_payload() (with no args) with the above input works. With get_payload(), I had previously thought of using the errors='replace' kw arg because some incoming messages had (apparently) malformed utf8 causing the python script to terminate early.

I forget the sequence, but I later came across some input that contained a 3-byte utf8 character ’ (Right Single Quotation Mark, U+2019); the sender's software incorrectly used that in the word "I'm" instead of a simple ASCII apostrophe. With errors=replace, I was now seeing \u2019 in the output (see below) -- I suppose that is raw-unicode-escape ? (The umlaut is 2 bytes.) I thought I could prevent such problems by going to newer get_content(), and the coding is simpler. Plus the documentation says:

https://docs.python.org/3/library/email.compat32-message.html?highlight=get_payload#email.message.Message.get_payload This is a legacy method. On the EmailMessage class its functionality is replaced by get_content() and iter_parts().

When I added that \ui2019 character to the input, here is what I got, with various methods:

get_payload() : here is the hötel where I’m staying  

get_payload(decode=True).decode() : UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 13: invalid start byte

get_payload(decode=True) : b'here is the h\xf6tel where I\\u2019m staying  \n'

get_payload(decode=True).decode(errors='replace') : here is the h�tel where I\u2019m staying  

get_content() : here is the h�tel where I\u2019m staying  

So I will try reverting to plain get_payload() and see how that works with future real-world email messages. Of course, there are countless ways an email message can be badly encoded by the MUA or by text that is pasted into the MUA from another program. (I've sent my share of malformed messages. :-) )

Also interesting: https://docs.python.org/3/whatsnew/3.2.html?highlight=get_payload Given bytes input to the model, get_payload() will by default decode a message body that has a Content-Transfer-Encoding of 8bit using the charset specified in the MIME headers and return the resulting string.

I like your idea about how to fix it, although I don't know how to do it.

michaelfm1211 commented 1 year ago

Yep, it looks like that odd behavior with the right single quotation mark character is being caused by encoding with raw-unicode-escape. Just to confirm I ran it through .encode() manually and got the same result as you:

>>> 'here is the hötel where I’m staying  \n'.encode('raw-unicode-escape')
b'here is the h\xf6tel where I\\u2019m staying  \n'

Regarding the documentation's note on get_content() being a legacy method: get_content() just calls get_payload() under the hood when using the raw_data_manager (which is used by the default policy), so I guess the "legacy method" note is only for end users. get_payload() is still very much involved in the process.

Regarding the What's New in Python 3.2's description of get_payload() you linked, I find it odd that the docs say it will decode the message with a CTE of 8bit, but in the code, it will only attempt to use the ASCII codec (which should only be used for 7bit and other CTEs that only use ASCII, such as base64) before falling back to raw-unicode-escape. I implemented my fix idea in this PR, and it handles the right single quotation mark character as expected:

msg.get_payload(): here is the hötel where I’m staying

msg.get_payload(decode=True).decode(): here is the hötel where I’m staying

msg.get_payload(decode=True): b'here is the h\xc3\xb6tel where I\xe2\x80\x99m staying\n'
msg.get_payload(decode=True).decode(errors='replace'): here is the hötel where I’m staying

msg.get_content(): here is the hötel where I’m staying
dougmccasland commented 1 year ago

Thanks Michael.

Adding some more: When I use get_payload() (no args) and the message has CTE quoted-printable, the payload is not QP-decoded; but it's correctly QP-decoded with get_content():

Input:

Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Subject: test msg qp
To: Dmcc <foobar1@gmail.com>
From: Dmcc <foobar2@gmail.com>

here is the h=C3=B6tel where I=E2=80=99m staying=20=20

Results:

get_payload() : here is the h=C3=B6tel where I=E2=80=99m staying=20=20

get_payload(decode=True) : b'here is the h\xc3\xb6tel where I\xe2\x80\x99m staying  \n'

get_payload(decode=True).decode(errors='replace') : here is the hötel where I’m staying  

get_content() : here is the hötel where I’m staying  

So as a work-around, I am using get_payload() (no args) for CTE 8bit, and get_content() (no args) for CTE QP (or anything besides 8bit). Tested with CTE base64 and get_content() works for that.

--Doug

rapidcow commented 3 months ago

I played around with this for a bit, and it seems that this only happens when the _payload attribute is set to the Unicode string itself rather than the usual surrogate-escaped UTF-8-encoded bytes.

With the string parser, the issue is exactly as you described:

import email, email.policy
msg_from_str = email.message_from_string("""\
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

hötel piñata
""", policy=email.policy.default)
print((msg_from_str._payload, msg_from_str.get_content()))
# ==> ('hötel piñata\n', 'h�tel pi�ata\n')

With every other method I can think of, _payload is always the surrogate escapes:

import email, email.policy, email.message, email.charset
msg_from_bytes = email.message_from_bytes("""\
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

hötel piñata
""".encode('utf-8'), policy=email.policy.default)
print((msg_from_bytes._payload, msg_from_bytes.get_content()))
# ==> ('h\udcc3\udcb6tel pi\udcc3\udcb1ata\n', 'hötel piñata\n')

# set body_encoding to something other than QP and BASE64
# so we can use 7bit/8bit CTE?
# i don't know, legacy interface is quirky....
import email.charset
UTF8_RAW = email.charset.Charset('utf-8')
UTF8_RAW.body_encoding = email.charset.UNKNOWN8BIT
msg_from_legacy_api = email.message.EmailMessage()
msg_from_legacy_api.set_payload('hötel piñata\n', UTF8_RAW)
print((msg_from_legacy_api._payload, msg_from_legacy_api.get_content()))
# ==> ('h\udcc3\udcb6tel pi\udcc3\udcb1ata\n', 'hötel piñata\n')

msg_from_new_api = email.message.EmailMessage()
msg_from_new_api.set_content('hötel piñata\n', charset='utf-8')
print((msg_from_new_api._payload, msg_from_new_api.get_content()))
# ==> ('h\udcc3\udcb6tel pi\udcc3\udcb1ata\n', 'hötel piñata\n')

which actually makes this a very strangely specific case, yeah....