rgladwell / imap-upload

Python script for uploading a local mbox file to IMAP4 server.
Other
130 stars 30 forks source link

Fix: support for unicode file names #20

Closed rgladwell closed 2 years ago

rgladwell commented 3 years ago

Script fails with a codec error for mailbox file names with unicode characters:

An unknown error has occurred [493]:  'ascii' codec can't decode byte 0xc3 in position 24: ordinal not in range(128)

This fixes this bug and closes:

https://github.com/rgladwell/imap-upload/issues/19

rgladwell commented 3 years ago

@mrzool Can you please verify if this fixes the issue for you.

mrzool commented 3 years ago

Hey @rgladwell, sadly it doesn't, though the character reported in the error message is now different.

With a pathname containing an ü character, on master I get the following error:

An unknown error has occurred [493]:  'ascii' codec can't decode byte 0xc3 in position 42: ordinal not in range(128)

With this last patch, I get

An unknown error has occurred [494]:  'ascii' codec can't encode character u'\xfc' in position 11: ordinal not in range(128)

EDIT: seems to be a common error.

rgladwell commented 3 years ago

@mrzool I fixed this issue for unicode MBOX file names, but not for unicode sub-directory names. Please retry the latest commit and let me know how you get on.

mrzool commented 3 years ago

I'm still getting that error, but now it's happening later in the process, and once for every email.

With ec2beaf:

Connecting to mail.your-server.de:993.
Found mailbox at archiv/INBOX/Drehbücher/Drehbücher 2010.mbox/mbox...
An unknown error has occurred [494]:  'ascii' codec can't encode character u'\xfc' in position 11: ordinal not in range(128)

With the last patch b2967f5:

Connecting to mail.your-server.de:993.
Found mailbox at archiv/INBOX/Drehbücher/Drehbücher 2010.mbox/mbox...
Uploading to INBOX.Drehbücher.Drehbücher 2010...
Counting the mailbox (it could take a while for the large one).
  1/211   2.6 kB  Deutsche Drehbücher Bestellu    NG ('ascii' codec can't encode character u'\xfc' in position 24: ordinal not in range(128))
  1/211  37.9 kB  Re: Abonnement DEUTSCHE DREHB    NG ('ascii' codec can't encode character u'\xfc' in position 24: ordinal not in range(128))
  1/211   5.1 kB  Abonnement DEUTSCHE DREHBÜCHE    NG ('ascii' codec can't encode character u'\xfc' in position 24: ordinal not in range(128))
  1/211   5.3 kB  Abonnement DEUTSCHE DREHBÜCHE    NG ('ascii' codec can't encode character u'\xfc' in position 24: ordinal not in range(128))
  1/211   5.2 kB  Abonnement DEUTSCHE DREHBÜCHE    NG ('ascii' codec can't encode character u'\xfc' in position 24: ordinal not in range(128))
  1/211   5.2 kB  Abonnement DEUTSCHE DREHBÜCHE    NG ('ascii' codec can't encode character u'\xfc' in position 24: ordinal not in range(128))
  1/211   5.2 kB  Abonnement DEUTSCHE DREHBÜCHE    NG ('ascii' codec can't encode character u'\xfc' in position 24: ordinal not in range(128))
  1/211   5.2 kB  Abonnement DEUTSCHE DREHBÜCHE    NG ('ascii' codec can't encode character u'\xfc' in position 25: ordinal not in range(128))
...
rgladwell commented 3 years ago

@mrzool I suspect it's the unicode characters in the email subject lines this time. Do you have an example MBOX file I can use to test locally?

mrzool commented 3 years ago

@rgladwell I'll send a sample mbox your way now. Thanks a lot!

rgladwell commented 3 years ago

@mrzool Which version of python are you using?

rgladwell commented 3 years ago

@mrzool Also what testing IMAP server are you using? A hosted one or self-hosted? Software?

mrzool commented 3 years ago

@rgladwell python -V says Python 3.8.5 on my system (macOS Mojave).

The server is our production mailserver on Hetzner on a managed server plan. Not sure about which IMAP implementation are they running exactly, but I mentioned the server capabilities in my previous issue here.

Is this a server issue again?

rgladwell commented 3 years ago

Not sure: I think the Python IMAP API doesn't handle UTF-8 strings unless the UTF8=ACCEPT capability is enabled.

However, your IMAP server doesn't appear to support the ENABLE capability. Which is required to enable the UTF-8 support on both the client and server.

Capabilities it does support are:

('IMAP4', 'IMAP4REV1', 'UIDPLUS', 'CHILDREN', 'NAMESPACE', 'THREAD=ORDEREDSUBJECT', 'THREAD=REFERENCES', 'SORT', 'QUOTA', 'IDLE', 'ACL', 'ACL2=UNION')

I'm not an expert on IMAP, not sure if this is a standard security configuration or something specific to your install/configuration.

mrzool commented 3 years ago

This might be a complicated edge case involving this particular IMAP configuration. What's weird about it, though, is that I already uploaded thousands of emails with unicode subject lines and bodies using imap-upload and never had a single issue with it.

Only pathnames containing unicode chars were causing troubles, which I worked around by cleaning them with detox before the upload up until now. The issue with the subject lines only started with either ec2beaf or b2967f5.

Not sure what to make of that? 🤔

rgladwell commented 3 years ago

It was a long shot to assume the issue was environmental. I suspect these are limitations of the Python imaplib API. Could be resolved by switching to another Python library, but that would stop this being a stand-alone script.

rgladwell commented 3 years ago

Sorry, I'm currently busy with other work so haven't had time to take a look at this issue.

If you have the time yourself, I'd be happy to give advice and review code. Otherwise, it maybe a while before I can get back round to this.

rgladwell commented 3 years ago

This seems a likely candidate as an alternative IMAP library:

https://github.com/mjs/imapclient