joeyates / imap-backup

Backup and Migrate IMAP Email Accounts
MIT License
1.33k stars 74 forks source link

Import support from a local storage format (mbox, maildir, Thunderbird .sbd, ...) #189

Closed bentolor closed 3 months ago

bentolor commented 5 months ago

Hi @joeyates and thanks for your awesome tool which provides my valueable services to backup my data.

I've the situation where I gathered around 25 years of email in my local Thunderbird archive. Now I want to copy them (including hundreds of folders) onto an IMAP server for online accessibility.

Any hints or maybe related tool to achieve this? Or related tool?

I just tried import-export-tools-ng in Thunderbird and was expecting an import to IMAP option. But import-export-tools-ng only supports import into local folders. And according to my understanding, imap-backup only offers export to local Thunderbird Archive.

Any hint which mail army knife might help me in closing the gap?

joeyates commented 5 months ago

Hi @bentolor

That's an interesting one! If the thunderbird gem had a mailbox message iterator, the rest would just be a bit of glue and deciding on the import and export paths :)

...I'll have a look

bentolor commented 5 months ago

Thanks @joeyates for your quick feedback and help.

Meanwhile I was able to spot the little Python-Script https://github.com/rgladwell/imap-upload/ which, after some fiddling, allowed me to upload a local MBOX export. So my immediate problem has been solved and now I realize the challenges of having a self-hosted, web/mobile full-text searchable mail archive.

I still think that for symmetry a imap-backup utils import-from-thunderbird FOLDER would be a great addition.

On the same lines was also missing a imap-backup remote accounts command lately ;-).

joeyates commented 5 months ago

With Thunderbird, it's not sufficient to read the mailbox file itself to get the messages.

This is for two reasons.

Firstly, the mailbox may contain messages that have been deleted.

Secondly, there is an edge case regarding plain-text emails in finding message boundaries. The following is a note explaining this second problem.

Each message starts with a 'From' line e.g.:

From - Sun Jan 14 09:39:37 2024

To find the following message, it is not sufficient to search for lines with that format as lines in the email body itself may match.

If the email message is multipart (text+html) the boundary markers can be used to skip past the body, e.g.:

From - Sun Jan 14 09:36:08 2024
X-Mozilla-Status: 0001
X-Mozilla-Status2: 00000000
MIME-Version: 1.0
Date: Sun, 14 Jan 2024 09:35:57 +0100
Message-ID: <CAD0bxQFFegRpHErN1rAKEp1tqwxVBhb=e2UcoBRurSSYMF+Bew@mail.gmail.com>
Subject: Blah
From: Me <me@example.com>
To: You <you@example.com>
Content-Type: multipart/alternative; boundary="00000000000043120b060ee3c968"

--00000000000043120b060ee3c968
Content-Type: text/plain; charset="UTF-8"

From - Sun Jan 14 09:34:15 2024
The previous line is part of the email body!

--00000000000043120b060ee3c968
Content-Type: text/html; charset="UTF-8"

<div dir="ltr"><div>From - Sun Jan 14 09:34:15 2024</div><div>The previous line is part of the email body!</div><div><br></div></div>

--00000000000043120b060ee3c968--

This is not possible for text-only emails, which don't have boundary markers, e.g.:

From - Sun Jan 14 09:39:37 2024
X-Mozilla-Status: 0001
X-Mozilla-Status2: 00000000
MIME-Version: 1.0
Date: Sun, 14 Jan 2024 09:39:34 +0100
Message-ID: <CAD0bxQEb=3gAWk6Gys3FUhOURsTOns9sREQhVJVrD-Quq=gTQg@mail.gmail.com>
Subject: Blah
From: Me <me@example.com>
To: You <you@example.com>
Content-Type: text/plain; charset="UTF-8"

From - Sun Jan 14 09:34:15 2024
The previous line is part of the email body!

So, I believe that, to correctly identify the message boundaries in Thunderbird mailboxes, it is necessary to parse the associated *.msf index file.

These files contain indicators for the position and length of current messages in the mail box (msgOffset, offlineMsgSize).

Unfortunately, Thunderbird still uses the dreaded Mork file format for these files.

joeyates commented 5 months ago

I'll leave this open in the hope that an easier solution comes to light. Otherwise, I may just write a Mork parser!

bentolor commented 5 months ago

Thanks for your research and friendly feedback!

Mork being called out on Wikipedia as

He has lambasted the ostensibly "textual" format on the grounds that it is "not human-readable",[3] bemoaned the impossibility of writing a correct parser for the format,[4] and referred to it as "...the single most braindamaged file format that I have ever seen in my nineteen year career".[4]

I'm not sure If I'd recommend to write a Mork parser for the sake of sanity ;-)

I understood (and handled) the .msf files as throwaway-files, especially when my fulltext index got corrupted. But I also do have a few corrupted emails where i'm not aware of the source of corruption.

Reliably storing emails – how hard can it be?!?

.mbox file format familiy: Hold my beer!

joeyates commented 3 months ago

I've added a contrib script with an example of import from Thunderbird.