martinrusev / imbox

Python IMAP for Human beings
MIT License
1.18k stars 190 forks source link

message.get_payload(decode=True) is removing some special characters #181

Open pulse-mind opened 4 years ago

pulse-mind commented 4 years ago

In parser.py, line 125 content = message.get_payload(decode=True) is removing some special characters like ç or é or... It works fine with message.get_payload(decode=False) like this :


    content = message.get_payload(decode=False)
    charset = message.get_content_charset('utf-8')
    try:
        return content.decode(charset, 'ignore')
    except LookupError:
        return content.decode(charset.replace("-", ""), 'ignore')
    except AttributeError:
        return content```

Do you want a pull request ? 

Or another solution ? 
ghost commented 4 years ago

What encoding has that e-mail you are having problems with? (is it utf-8?)

That code comes from:

https://github.com/martinrusev/imbox/commit/ba913fe31dd6146f9500583916d4332edce1c481 https://github.com/martinrusev/imbox/pull/78

pulse-mind commented 4 years ago

Yes I was receiving an email in UTF-8. The email was send by another server (woocommerce).

tobip commented 4 years ago

I have the same problem.

In my e-mail it says charset=utf-8, while there are actually latin-1 characters in it.

Example: b'\xe4\xf6\xfc\xc4\xd6\xdc\xdf' that should translate to this: 'äöüÄÖÜß'

Imbox reads from the raw body the charset=utf-8 info and uses this to decode the text, which leads to loss of the latin-1 characters.

As a hack, I changed line 129 in parser.py to following code:

    latinchars = [b'\xe4', b'\xf6', b'\xfc', b'\xc4', b'\xd6', b'\xdc', b'\xdf']
    if any(s in content for s in latinchars):
        charset='latin-1'
    else:
        charset = message.get_content_charset('utf-8')

Other characters can be found here or in python with 'ä'.encode('latin-1')

Edit: To just set message.get_payload(decode=False) will lead to problems if the e-mail is actually encoded with utf-8

Another edit: At my computer, Thunderbird sends latin-1 characters while setting the charset=utf-8.