joeyates / imap-backup

Backup and Migrate IMAP Email Accounts
MIT License
1.37k stars 75 forks source link

Disk writes grow larger and slower the further into the backup we get #157

Closed njakobsen closed 1 year ago

njakobsen commented 1 year ago

I'm at 150,000 messages imported and I'm seeing a sustained disk write speed of 17MB/s, almost 1TB since I started the import this morning. In total, the .imap file is only about 9MB, so it's being rewritten about twice per second. The writes to the .mbox are negligible in comparison, at about 20KB/s.

It looks like the Serializer::Imap class must process and save the JSON metadata every time a message is appended. This starts out fast and easy, but soon grows slower as more and more messages are imported.

https://github.com/joeyates/imap-backup/blob/ca7d36d45e932d097c7ef4ad0187326b1145fb39/lib/imap/backup/serializer/imap.rb#L109-L121

Would it be possible to keep the .imap JSON in memory and flush is periodically instead of on every message? This would increase import speed and reduce drive wear. If the process is interrupted with a SIGKILL then we'd need to reimport the unflushed messages since they won't be in the imap file, but I think that would be preferable to the amount of overhead currently being performed.

njakobsen commented 1 year ago

In the mean time, anyone experiencing performance issues with large imports can use a RAM disk. See https://gist.github.com/htr3n/344f06ba2bb20b1056d7d5570fe7f596 for MacOS instructions. Just remember to move it to permanent storage when you're all done. Performance is not dramatically increased when using this to store the mailboxes while it's backing up, but you save your poor SSD. The lack of improvement when switching to RAM suggests the majority of the slowdown comes from processing as opposed to read/write.

joeyates commented 1 year ago

Hi @njakobsen, thanks for opening the issue.

You are quite correct in saying that the .imap file is rewritten for each downloaded message during backup.

The aim is to minimize the risk that the mailbox file (.mbox) and the associated metadata (the .imap) go out of sync.

While I would like to maintain this guarantee, I understand the need for something faster for very large mailboxes.

One idea that comes to mind is an "unsafe writes" option. This would skip saving the .imap file until the end of the download for a particular mailbox. Instead of trying to guarantee the integrity of the two files, this could be used in conjunction with a verification pass using imap-backup local check.

njakobsen commented 1 year ago

Locally I played around with the at_exit callback to save the progress on quit, though I don't know how safe it is to write after interrupting the program. A safer solution would be to incrementally write, maybe every 100 or 500, and then on complete.

While I have your attention, another large backup issue I ran into was that the v2->3 migration check is run on every message as part of the validate! method. I solved the problem locally by remembering that we've already checked and not doing it again. With that change the performance issues disappear and importing a large backup actually goes about as fast as a small one (except for the extra download time). If you think that's a reasonable thing to do, I can make a pull request.

joeyates commented 1 year ago

Good catch about the repeated check via the Version2Migrator. A PR would be much appreciated.

njakobsen commented 1 year ago

Done. See https://github.com/joeyates/imap-backup/pull/158.

njakobsen commented 1 year ago

@joeyates I would still keep this one open since the ticket was originally about the excessive disk writes, which were not resolved by #158, only the performance issues. I'll keep thinking about this one. I think ideally we would just flush occasionally, but I haven't had a chance to find a safe way to flush when we ctrl+c, or reach the end of the import. I'm open to ideas for when I have a chance to return to this.

joeyates commented 1 year ago

@njakobsen I'm looking at using a new system to store the IMAP metadata - replacing a JSON file with something that provides some integrity guarantees and writes quicker.

Currently, I'm playing with the idea of using SQLite. For folders with very few emails, this would cause slightly more disk use (8k instead of <100 bytes), but I think the reliability and performance gains would outweigh this small disadvantage.

To make all this possible, I'm doing a long-overdue refactor of some of the core in the feature/refactor-core-classes branch.

joeyates commented 1 year ago

I've decided against adding SQLite as a dependency as it would make setup of the gem a lot more difficult.

Instead, I've implemented an optional system that holds all of a folder's new downloads in memory and makes just one rewrite of the .imap metadata file.

I'm completing profiling of this for large workloads (>100k emails) and it makes a massive speed improvement.

leojonathanoh commented 1 year ago

It seems performance of backing up to an .mbox would be an issue for larger mailboxes. For such cases, guess it might be better to backup emails as MailDir using isync? It is very performant from my testing, backs up incrementally very quickly, since the files in a MailDir are a in flat structure.

joeyates commented 1 year ago

@njakobsen I've released 11.0.0.rc1 with optional delayed writes during folder backup.

All additions to the mailbox's .imap metadata file are stored in memory and written just once.

I'll publish the benchmarks I've done soon, but, for >100k messages, I'm seeing a 20-30 times speedup.

njakobsen commented 1 year ago

That's very exciting! I'll be able to revert my local hacks. Thanks for your help addressing this, and for the great tool.

joeyates commented 1 year ago

I released this as version 11.0.0.