python / cpython

The Python programming language
https://www.python.org
Other
62.55k stars 30.02k forks source link

message_from_string return msg with wrong encode #94600

Open jinlianch opened 2 years ago

jinlianch commented 2 years ago

Bug report

image The text contain non ascii chars. The content of text/plain should show as above. While use email.message_from_string to parse the mime, message.get_payload(decode=True) decode "text/plain" part return wrong encode message.

Debug the code, found here https://github.com/python/cpython/blob/3.10/Lib/email/message.py#L278, get_payload return payload.encode('raw-unicode-escape'), but when I use message.get_charsets() it return utf-8, it doesn't match the encode charset. So the result is wrong. The final result is below, the charset is wrong, then I can't get the correct message.

image

Your environment Python 3.9.10, macOS Catalina, version 10.15.5

Test code is below. Runing command: python3 t.py TextBased.eml

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import logging
import sys
import email

def get_all_block(message, block_type = "text/plain"):
    content_type = message.get_content_type()
    main_type = message.get_content_maintype()
    if main_type == "multipart":
        if message.is_multipart():
            block = None
            for part in message.get_payload():
                result = get_all_block(part, block_type)
                if result:
                    if block is None:
                        block = result
                    else:
                        block += result
            return block
        else:
            return None
    elif content_type == block_type:
        result = message.get_payload(decode=True)
        if result is not None:
            charsets = message.get_charsets()
            print('charsets', charsets, result)
        return result
    else:
        return None

if __name__ == '__main__':
    fname = sys.argv[1]
    fp = open(fname, 'rb')
    mime = fp.read().decode('utf-8', errors='ignore')
    message = email.message_from_string(mime)
    text = get_all_block(message, "text/plain")

TextBased.txt

rruuaanng commented 2 weeks ago

Bug report

image The text contain non ascii chars. The content of text/plain should show as above. While use email.message_from_string to parse the mime, message.get_payload(decode=True) decode "text/plain" part return wrong encode message.

Debug the code, found here https://github.com/python/cpython/blob/3.10/Lib/email/message.py#L278, get_payload return payload.encode('raw-unicode-escape'), but when I use message.get_charsets() it return utf-8, it doesn't match the encode charset. So the result is wrong. The final result is below, the charset is wrong, then I can't get the correct message.

image

Your environment Python 3.9.10, macOS Catalina, version 10.15.5

Test code is below. Runing command: python3 t.py TextBased.eml

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import logging
import sys
import email

def get_all_block(message, block_type = "text/plain"):
    content_type = message.get_content_type()
    main_type = message.get_content_maintype()
    if main_type == "multipart":
        if message.is_multipart():
            block = None
            for part in message.get_payload():
                result = get_all_block(part, block_type)
                if result:
                    if block is None:
                        block = result
                    else:
                        block += result
            return block
        else:
            return None
    elif content_type == block_type:
        result = message.get_payload(decode=True)
        if result is not None:
            charsets = message.get_charsets()
            print('charsets', charsets, result)
        return result
    else:
        return None

if __name__ == '__main__':
    fname = sys.argv[1]
    fp = open(fname, 'rb')
    mime = fp.read().decode('utf-8', errors='ignore')
    message = email.message_from_string(mime)
    text = get_all_block(message, "text/plain")

TextBased.txt

That is to say, what you need is the content of the attachment in the email.

rruuaanng commented 2 weeks ago

After removing those two non-ASCII characters, the result is as follows:

charsets ['utf-8'] b" \tHello! If Raeshlavik#1953 isn't your BattleTag, don't click\r\nanything in this email!\t\r\n\t\r\n\t\r\n \t\r\n\t\r\n\t\r\n <http://blizzard.com/>  \t\r\n\t\r\n \t\r\n\t\r\n\t\r\nYou purchased \t\r\nWorld of Warcraft\xae Character Service: Faction Change \t\r\nOrder Number\t\r\n443473986\t\r\n\t\r\n\t\r\n \t\r\n\t\r\n\t\r\n\t\r\n\t\r\nWhat's Next?\t\r\nYou can start your download next time you log in to the Blizzard\r\nBattle.net\xae App.\t\r\n\t\r\n\t\r\n Download the Blizzard Battle.net\xae App  \t\r\n\t\r\n\t\r\nPurchase Details \t\r\n\t\r\n\t\r\nPurchase Date\t\r\nFeb 07 2018\t\r\n\t\r\nCustomer Name\t\r\nWilliam Miller\t\r\n\t\r\nPayment Method\t\r\nAmerican Express*****1009\t\r\n\t\r\n\t\r\nInvoice Number\t\r\n1692685494\t\r\n\t\r\nItem(s)\t\r\nWorld of Warcraft\xae Character Service: Faction Change\t\r\n\t\r\n\t\r\n\t\r\n\t\r\nSubtotal\t\r\nUSD 30.00\t\r\n\t\r\nTotal \t\r\nUSD 30.00\t\r\n\t\r\n\t\r\nTax (including any applicable VAT) \t\r\nUSD 0.00\t\r\n\t\r\n\t\r\n\t\r\n\t\r\nSold by\t\r\nBlizzard Entertainment\xae\t\r\n\t\r\n\t\r\n\t\r\n\t\r\n\t\r\n \t\r\n\t\r\n\t\r\nStay Connected With Blizzard \t\r\n\t\r\n\t\r\n <https://www.facebook.com/Blizzard/>\r\n<https://www.twitter.com/Blizzard_Ent/>\r\n<https://www.reddit.com/r/Blizzard/>  \t\r\n\t\r\n\t\r\n\t\r\n \t\r\n\t\r\n\t\r\n <http://blizzard.com/>  \t\r\n\t\r\n\t\r\n\t\r\nTerms of Sale\t|\tOnline Privacy Policy\t\r\n\t\r\n\t\r\n\t\r\nThanks for shopping with us! Visit us again at Blizzard Shop \r\nIf you have any questions or concerns about your order, please visit the\r\nBlizzard Support Site \r\nBlizzard Entertainment, Inc., 1 Blizzard Way, Irvine, CA 92618 \t\r\n\t\r\n\t\r\n \t\r\n\t\r\n\t\r\n\t\r\n\t\r\n"
<sys>:0: ResourceWarning: unclosed file <_io.BufferedReader name='TextBased.txt'>

I tested it on the main branch, but I'm not sure if this is what you need.