Problems decoding content

cvlli commented 7 years ago

Traceback (most recent call last): File "/root/PycharmProjects/Teste/main.py", line 26, in <module> next_message = next(all_messages) File "/usr/local/lib/python3.5/dist-packages/imbox/__init__.py", line 50, in fetch_list yield (uid, self.fetch_by_uid(uid)) File "/usr/local/lib/python3.5/dist-packages/imbox/__init__.py", line 41, in fetch_by_uid email_object = parse_email(raw_email) File "/usr/local/lib/python3.5/dist-packages/imbox/parser.py", line 151, in parse_email content = decode_content(part) File "/usr/local/lib/python3.5/dist-packages/imbox/parser.py", line 119, in decode_content return content.decode(charset) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 906: invalid start byte

Some emails can have content encoded with Latin-1, Latin-15 or something else.

May add:

` import chardet ...

def _decode_content(message): targetEncoding = "utf-8"

content = message.get_payload(decode=True)
charset = message.get_content_charset('utf-8')
try:
    sourceEncoding = chardet.detect(content).get("encoding")
    content = str(content).decode(sourceEncoding).encode(targetEncoding)
    return content.decode(charset)
except AttributeError:
    return content

`

sblondon commented 7 years ago

I confirm the issue. I have an example where there is an encoding error: 'utf-8' codec can't decode byte 0xc3 in position 1856: invalid continuation byte

The e-mail is correctly parsed when I change the line: raw_email = str_encode(raw_email, 'utf-8') by raw_email = str_encode(raw_email, 'latin-1')

This problem is probably the duplicate of #64.

chardet library could fix the problem, but perhaps there are others solutions?

@martinrusev What do you think about it? Interested by a pull-request?

By the way, I think the if at the begining of the parse_email() function is not necessary because raw_email is always a byte type: parse_email() is called only by imbox.Imbox.fetch_by_uid() and the returned data by imaplib seems to always be bytes.

martinrusev commented 7 years ago

@sblondon chardet would be a nice fix for this problem. A Pull request is always welcome !

sblondon commented 6 years ago

I checked the erroneous message with the latest imbox version from the repository and I can't reproduce the error. So I will not send a pull-request until I get a new error. I have no idea when it will occur again. Perhaps never?

@cvlli If you still have errors, could you provide a file example? If you have login access to the IMAP server, the file is probably is ~/Maildir/cur (or tmp). The goal is to add another test case to fix the issue and avoid encoding error in the future.

ghost commented 6 years ago

has most probably been fixed by #96 and can be closed

ghost commented 6 years ago

And #78

wesinator commented 5 years ago

Similar decoding error in 0.9.5

  File ".local/lib/python3.6/site-packages/imbox/__init__.py", line 57, in fetch_list
    yield (uid, self.fetch_by_uid(uid))
  File ".local/lib/python3.6/site-packages/imbox/__init__.py", line 48, in fetch_by_uid
    email_object = parse_email(raw_email, policy=self.parser_policy)
  File ".local/lib/python3.6/site-packages/imbox/parser.py", line 181, in parse_email
    parsed_email['sent_from'] = get_mail_addresses(email_message, 'from')
  File ".local/lib/python3.6/site-packages/imbox/parser.py", line 55, in get_mail_addresses
    addresses[index] = {'name': decode_mail_header(address_name),
  File ".local/lib/python3.6/site-packages/imbox/parser.py", line 36, in decode_mail_header
    logger.debug("Mail header no. {}: {} encoding {}".format(index, str_decode(text, charset or 'utf-8'), charset))
  File ".local/lib/python3.6/site-packages/imbox/utils.py", line 12, in str_decode
    return value.decode(encoding or 'utf-8', errors=errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 20: invalid start byte

sblondon commented 5 years ago

@wesinator can you provide the header which produce the bug?

wesinator commented 5 years ago

@sblondon No, unfortunately I lost that specific one, it got deleted. But if I see it again I'll try to provide a header.

sblondon commented 5 years ago

ok, thanks @wesinator

martinrusev / imbox

Problems decoding content #77