Open jinlianch opened 2 years ago
Bug report
The text contain non ascii chars. The content of text/plain should show as above. While use email.message_from_string to parse the mime, message.get_payload(decode=True) decode "text/plain" part return wrong encode message.
Debug the code, found here https://github.com/python/cpython/blob/3.10/Lib/email/message.py#L278, get_payload return payload.encode('raw-unicode-escape'), but when I use message.get_charsets() it return utf-8, it doesn't match the encode charset. So the result is wrong. The final result is below, the charset is wrong, then I can't get the correct message.
Your environment Python 3.9.10, macOS Catalina, version 10.15.5
Test code is below. Runing command: python3 t.py TextBased.eml
#!/usr/bin/env python # -*- coding: utf-8 -*- import logging import sys import email def get_all_block(message, block_type = "text/plain"): content_type = message.get_content_type() main_type = message.get_content_maintype() if main_type == "multipart": if message.is_multipart(): block = None for part in message.get_payload(): result = get_all_block(part, block_type) if result: if block is None: block = result else: block += result return block else: return None elif content_type == block_type: result = message.get_payload(decode=True) if result is not None: charsets = message.get_charsets() print('charsets', charsets, result) return result else: return None if __name__ == '__main__': fname = sys.argv[1] fp = open(fname, 'rb') mime = fp.read().decode('utf-8', errors='ignore') message = email.message_from_string(mime) text = get_all_block(message, "text/plain")
That is to say, what you need is the content of the attachment in the email.
After removing those two non-ASCII characters, the result is as follows:
charsets ['utf-8'] b" \tHello! If Raeshlavik#1953 isn't your BattleTag, don't click\r\nanything in this email!\t\r\n\t\r\n\t\r\n \t\r\n\t\r\n\t\r\n <http://blizzard.com/> \t\r\n\t\r\n \t\r\n\t\r\n\t\r\nYou purchased \t\r\nWorld of Warcraft\xae Character Service: Faction Change \t\r\nOrder Number\t\r\n443473986\t\r\n\t\r\n\t\r\n \t\r\n\t\r\n\t\r\n\t\r\n\t\r\nWhat's Next?\t\r\nYou can start your download next time you log in to the Blizzard\r\nBattle.net\xae App.\t\r\n\t\r\n\t\r\n Download the Blizzard Battle.net\xae App \t\r\n\t\r\n\t\r\nPurchase Details \t\r\n\t\r\n\t\r\nPurchase Date\t\r\nFeb 07 2018\t\r\n\t\r\nCustomer Name\t\r\nWilliam Miller\t\r\n\t\r\nPayment Method\t\r\nAmerican Express*****1009\t\r\n\t\r\n\t\r\nInvoice Number\t\r\n1692685494\t\r\n\t\r\nItem(s)\t\r\nWorld of Warcraft\xae Character Service: Faction Change\t\r\n\t\r\n\t\r\n\t\r\n\t\r\nSubtotal\t\r\nUSD 30.00\t\r\n\t\r\nTotal \t\r\nUSD 30.00\t\r\n\t\r\n\t\r\nTax (including any applicable VAT) \t\r\nUSD 0.00\t\r\n\t\r\n\t\r\n\t\r\n\t\r\nSold by\t\r\nBlizzard Entertainment\xae\t\r\n\t\r\n\t\r\n\t\r\n\t\r\n\t\r\n \t\r\n\t\r\n\t\r\nStay Connected With Blizzard \t\r\n\t\r\n\t\r\n <https://www.facebook.com/Blizzard/>\r\n<https://www.twitter.com/Blizzard_Ent/>\r\n<https://www.reddit.com/r/Blizzard/> \t\r\n\t\r\n\t\r\n\t\r\n \t\r\n\t\r\n\t\r\n <http://blizzard.com/> \t\r\n\t\r\n\t\r\n\t\r\nTerms of Sale\t|\tOnline Privacy Policy\t\r\n\t\r\n\t\r\n\t\r\nThanks for shopping with us! Visit us again at Blizzard Shop \r\nIf you have any questions or concerns about your order, please visit the\r\nBlizzard Support Site \r\nBlizzard Entertainment, Inc., 1 Blizzard Way, Irvine, CA 92618 \t\r\n\t\r\n\t\r\n \t\r\n\t\r\n\t\r\n\t\r\n\t\r\n"
<sys>:0: ResourceWarning: unclosed file <_io.BufferedReader name='TextBased.txt'>
I tested it on the main branch, but I'm not sure if this is what you need.
Bug report
The text contain non ascii chars. The content of text/plain should show as above. While use email.message_from_string to parse the mime, message.get_payload(decode=True) decode "text/plain" part return wrong encode message.
Debug the code, found here https://github.com/python/cpython/blob/3.10/Lib/email/message.py#L278, get_payload return payload.encode('raw-unicode-escape'), but when I use message.get_charsets() it return utf-8, it doesn't match the encode charset. So the result is wrong. The final result is below, the charset is wrong, then I can't get the correct message.
Your environment Python 3.9.10, macOS Catalina, version 10.15.5
Test code is below. Runing command: python3 t.py TextBased.eml
TextBased.txt