rgladwell / imap-upload

Python script for uploading a local mbox file to IMAP4 server.
Other
130 stars 30 forks source link

Freezes on ð character in subject line #48

Open benfrancis opened 2 years ago

benfrancis commented 2 years ago

Thanks for this tool!

I just ran the script with the following command, on an .mbox file from Google Takeout containing approximately 7,000 emails:

$ python3 imap_upload.py --gmail --box imported takeout.mbox

It seems to have got stuck on an email with a subject line containing an ð ("eth") character. The full subject line is "FW: Youtube Job Wants You 👉 $20K/Month Potential! 80272150" (yes, it appears to be a spam message).

Is there anything I can do to recover from this? If I run the script a second time, will it upload duplicate emails?

benfrancis commented 2 years ago

I tried cleaning up some of the spam emails and re-exporting. This time it hung on a "â" character.

rgladwell commented 2 years ago

It seems to have got stuck on an email with a subject line containing an ð ("eth") character.

Sounds like a character encoding issue. Do you have the exact error message? Including the line of code throwing the error?

If I run the script a second time, will it upload duplicate emails?

I would believe so. You might be able to recover by searching for emails added at a specific date. Also, you could check to see if Gmail has added any specific labels for uploaded emails.

If this doesn't work, maybe we could add a feature to delete emails that exist in a specific mbox.

benfrancis commented 2 years ago

Sounds like a character encoding issue. Do you have the exact error message? Including the line of code throwing the error?

I'm afraid it didn't actually print an error to the console, it just stopped printing any output to the console after that character in the subject line.

It might be possible to reproduce by exporting an .mbox file with an email containing a ð or â character in its subject line. I think that was a real subject line designed to evade spam filters, not garbled output caused by your script. But I agree it seems like a character encoding issue in that something is crashing on certain UTF-8 characters.

I would believe so. You might be able to recover by searching for emails added at a specific date. Also, you could check to see if Gmail has added any specific labels for uploaded emails. If this doesn't work, maybe we could add a feature to delete emails that exist in a specific mbox.

In the end I fixed the problem by re-exporting the .mbox without the offending emails, but it took a while to get rid of them all and I had to delete several thousand emails each time I ran the script to avoid duplication. Fortunately the uploaded emails were labelled as "imported" by GMail which made that easy to do.

It would be useful if the script could de-duplicate emails when uploading, but I don't know how hard that is and how it would affect performance.

I've since discovered that Google have a couple of tools for this called mail importer and import-mailbox-to-gmail. They both look harder to use than your script, but the former features de-duplication and the latter has a --from_message parameter to re-start from a certain message number in the mailbox if something goes wrong.

rgladwell commented 2 years ago

I'm afraid it didn't actually print an error to the console, it just stopped printing any output to the console after that character in the subject line.

The script might successfully terminate without logging a message to the console. Can you confirm only a subset of your emails were uploaded?

Also, if you're using Google Takeout, have you tried using the --google-takeout-* arguments?

benfrancis commented 2 years ago

The script might successfully terminate without logging a message to the console. Can you confirm only a subset of your emails were uploaded?

Yes, it terminated after about 4,000 of 23,000 emails had been uploaded.

Also, if you're using Google Takeout, have you tried using the --google-takeout-* arguments?

No I didn't use that option because I didn't need to preserve labels for this particular upload. I may need that for future uploads though, so will try it next time thanks.

rgladwell commented 2 years ago

Could this issue be the cause of the freezing: https://github.com/rgladwell/imap-upload/pull/49

benfrancis commented 2 years ago

I think it's unlikely, the .mbox file was about 600MB and the PC the script was running on has 16GB RAM. It would also be a bit of a coincidence that it appeared to stop at unusual characters every time.

rgladwell commented 2 years ago

You could try running the script in the new dry-run mode, and capturing all the output.

adriangibanelbtactic commented 2 years ago

Our newest branch: https://github.com/btactic/imap-upload/tree/google_takeout_codepages_fixes_v1 which we will merge soon thanks to https://github.com/rgladwell/imap-upload/pull/54 deals better with wrong encoding in subject lines.

Prior to this improvement I have never experienced the program to end if there was such a problem with the subject encoding. The only thing that happened in my tests is that this particular email status line was not written and the email was skipped, next email was processed.

So... why don't you give it a go with the google_takeout_codepages_fixes_v1 branch and give us feedback?