rgladwell / imap-upload

Python script for uploading a local mbox file to IMAP4 server.
Other
130 stars 30 forks source link

Exception thrown when directory name contains non-ASCII characters #19

Closed mrzool closed 3 years ago

mrzool commented 3 years ago

When an mbox or directory name in a scanned path contains a non-ASCII character the script fails with the following error:

An unknown error has occurred [493]:  'ascii' codec can't decode byte 0xcc in position 62: ordinal not in range(128)

Tested with this folder structure:

Archiv
  └── testing
    ├── Drehbücher
    │   ├── Drehbücher\ 2008.mbox
    │   ├── Drehbücher\ 2009.mbox
    │   ├── Drehbücher\ 2010.mbox
    │   ├── Drehbücher\ 2011.mbox
    │   ├── Drehbücher\ 2012.mbox
    │   ├── Drehbücher\ 2013.mbox
    │   ├── Drehbücher\ 2014.mbox
    │   ├── Drehbücher\ 2015.mbox
    │   ├── Drehbücher\ 2016.mbox
    │   ├── Drehbücher\ 2017.mbox
    │   ├── Drehbücher\ 2018.mbox
    │   ├── Drehbücher\ 2019.mbox
    │   ├── Drehbücher\ 2020.mbox
    └── Drehbücher.mbox

Happy to test further and provide more feedback if needed.

rgladwell commented 3 years ago

@mrzool Do you have a test account I could use to test the script? If you have something like LastPass you can send the password securely, and delete the mailbox once this PR is closed.

mrzool commented 3 years ago

@rgladwell Will comment with some more feedback asap.

mrzool commented 3 years ago

Hey @rgladwell, I apologize for the delay, I've been very busy.

As briefly as possible: I don't think this is a problem with the mailserver. I think the error happens locally because of some weird encoding issue related to how Mail.app exports these mbox files. The issue seems to be caused exclusively by some file names, and not by the content of the files. I've spent quite some time testing and researching but I couldn't figure it out. I've hit a dead end. Here's what I found out.

The files causing the error look fine by listing them with ls:

$ ls
Drehbücher 2008.mbox         Drehbücher 2013.mbox         Drehbücher 2018.mbox
Drehbücher 2009.mbox         Drehbücher 2014.mbox         Drehbücher 2019.mbox
Drehbücher 2010.mbox         Drehbücher 2015.mbox         Drehbücher 2020.mbox
Drehbücher 2011.mbox         Drehbücher 2016.mbox         ONLINE - nachträglich.mbox
Drehbücher 2012.mbox         Drehbücher 2017.mbox

But if I list the content the same directory with another utility, like tree, I notice that something is off:

$ tree -L 1
.
├── Drehbu?\210cher\ 2008.mbox
├── Drehbu?\210cher\ 2009.mbox
├── Drehbu?\210cher\ 2010.mbox
├── Drehbu?\210cher\ 2011.mbox
├── Drehbu?\210cher\ 2012.mbox
├── Drehbu?\210cher\ 2013.mbox
├── Drehbu?\210cher\ 2014.mbox
├── Drehbu?\210cher\ 2015.mbox
├── Drehbu?\210cher\ 2016.mbox
├── Drehbu?\210cher\ 2017.mbox
├── Drehbu?\210cher\ 2018.mbox
├── Drehbu?\210cher\ 2019.mbox
├── Drehbu?\210cher\ 2020.mbox
└── ONLINE\ -\ nachtra?\210glich.mbox

14 directories, 0 files

tree supports non-ASCII characters with no issues. If I create a replica of this directory structure elsewhere, tree has no problem displaying the file names properly.

$ mkdir Drehbücher\ {2008..2020}.mbox && mkdir ONLINE\ -\ nachträglich.mbox
$ tree -L 1
.
├── Drehbücher\ 2008.mbox
├── Drehbücher\ 2009.mbox
├── Drehbücher\ 2010.mbox
├── Drehbücher\ 2011.mbox
├── Drehbücher\ 2012.mbox
├── Drehbücher\ 2013.mbox
├── Drehbücher\ 2014.mbox
├── Drehbücher\ 2015.mbox
├── Drehbücher\ 2016.mbox
├── Drehbücher\ 2017.mbox
├── Drehbücher\ 2018.mbox
├── Drehbücher\ 2019.mbox
├── Drehbücher\ 2020.mbox
└── ONLINE\ -\ nachträglich.mbox

14 directories, 0 files

git is also unable to display those umlauts correctly, although it uses a different escape sequence.

$ git init
Initialized empty Git repository in [path]
$ git st
On branch master
No commits yet
Untracked files:
  (use "git add <file>..." to include in what will be committed)
    "Drehb\303\274cher 2008.mbox/"
    "Drehb\303\274cher 2009.mbox/"
    "Drehb\303\274cher 2010.mbox/"
    "Drehb\303\274cher 2011.mbox/"
    "Drehb\303\274cher 2012.mbox/"
    "Drehb\303\274cher 2013.mbox/"
    "Drehb\303\274cher 2014.mbox/"
    "Drehb\303\274cher 2015.mbox/"
    "Drehb\303\274cher 2016.mbox/"
    "Drehb\303\274cher 2017.mbox/"
    "Drehb\303\274cher 2018.mbox/"
    "Drehb\303\274cher 2019.mbox/"
    "Drehb\303\274cher 2020.mbox/"
    "ONLINE - nachtr\303\244glich.mbox/"

nothing added to commit but untracked files present (use "git add" to track)

So, to sum it up, it looks like imap-upload is chocking on those files because of some dumb encoding/escaping issue with the filenames that I can't quite figure out. Mail.app seems to be the culprit, as those files come straight out of that app.

Do you have any idea for a fix or workaround? I'm out of ideas.

mrzool commented 3 years ago

Hey @rgladwell, afraid I need to rectify most of what I've said above.

I just found out that it's perfectly normal for git to display UTF-8 pathnames using octal notation.

I also tested imap-upload with the directory structure I manually created above (the one that gets correctly displayed by tree) and it fails in the same way:

imap_upload.py -r . imaps://testing@example.com:password@mail.your-server.de:993
Connecting to mail.your-server.de:993.
An unknown error has occurred [493]:  'ascii' codec can't decode byte 0xc3 in position 24: ordinal not in range(128)

So maybe Mail.app is not the culprit after all, and imap-upload might be generally unable to handle non-ASCII pathnames?

EDIT: Just tested it with my Gmail account, it fails in the same way after connecting to the server:

$ imap_upload.py --gmail -r . --user=[my_username] --password=[application_specific_password]
Connecting to imap.gmail.com:993.
An unknown error has occurred [493]:  'ascii' codec can't decode byte 0xc3 in position 24: ordinal not in range(128)
rgladwell commented 3 years ago

Thanks for the info, taking a look now.

rgladwell commented 3 years ago

Closed by #20