Open cvlli opened 7 years ago
I confirm the issue. I have an example where there is an encoding error:
'utf-8' codec can't decode byte 0xc3 in position 1856: invalid continuation byte
The e-mail is correctly parsed when I change the line:
raw_email = str_encode(raw_email, 'utf-8')
by
raw_email = str_encode(raw_email, 'latin-1')
This problem is probably the duplicate of #64.
chardet
library could fix the problem, but perhaps there are others solutions?
@martinrusev What do you think about it? Interested by a pull-request?
By the way, I think the if
at the begining of the parse_email()
function is not necessary because raw_email
is always a byte
type: parse_email()
is called only by imbox.Imbox.fetch_by_uid()
and the returned data by imaplib seems to always be bytes.
@sblondon chardet
would be a nice fix for this problem. A Pull request is always welcome !
I checked the erroneous message with the latest imbox version from the repository and I can't reproduce the error. So I will not send a pull-request until I get a new error. I have no idea when it will occur again. Perhaps never?
@cvlli If you still have errors, could you provide a file example? If you have login access to the IMAP server, the file is probably is ~/Maildir/cur (or tmp). The goal is to add another test case to fix the issue and avoid encoding error in the future.
has most probably been fixed by #96 and can be closed
And #78
Similar decoding error in 0.9.5
File ".local/lib/python3.6/site-packages/imbox/__init__.py", line 57, in fetch_list
yield (uid, self.fetch_by_uid(uid))
File ".local/lib/python3.6/site-packages/imbox/__init__.py", line 48, in fetch_by_uid
email_object = parse_email(raw_email, policy=self.parser_policy)
File ".local/lib/python3.6/site-packages/imbox/parser.py", line 181, in parse_email
parsed_email['sent_from'] = get_mail_addresses(email_message, 'from')
File ".local/lib/python3.6/site-packages/imbox/parser.py", line 55, in get_mail_addresses
addresses[index] = {'name': decode_mail_header(address_name),
File ".local/lib/python3.6/site-packages/imbox/parser.py", line 36, in decode_mail_header
logger.debug("Mail header no. {}: {} encoding {}".format(index, str_decode(text, charset or 'utf-8'), charset))
File ".local/lib/python3.6/site-packages/imbox/utils.py", line 12, in str_decode
return value.decode(encoding or 'utf-8', errors=errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 20: invalid start byte
@wesinator can you provide the header which produce the bug?
@sblondon No, unfortunately I lost that specific one, it got deleted. But if I see it again I'll try to provide a header.
ok, thanks @wesinator
Traceback (most recent call last): File "/root/PycharmProjects/Teste/main.py", line 26, in <module> next_message = next(all_messages) File "/usr/local/lib/python3.5/dist-packages/imbox/__init__.py", line 50, in fetch_list yield (uid, self.fetch_by_uid(uid)) File "/usr/local/lib/python3.5/dist-packages/imbox/__init__.py", line 41, in fetch_by_uid email_object = parse_email(raw_email) File "/usr/local/lib/python3.5/dist-packages/imbox/parser.py", line 151, in parse_email content = decode_content(part) File "/usr/local/lib/python3.5/dist-packages/imbox/parser.py", line 119, in decode_content return content.decode(charset) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 906: invalid start byte
Some emails can have content encoded with Latin-1, Latin-15 or something else.
May add:
` import chardet ...
def _decode_content(message): targetEncoding = "utf-8"
`