postalsys / emailengine

Headless email client
https://emailengine.app/
Other
1.89k stars 166 forks source link

Horizontal Scalability / Benchmarks #111

Closed psteinroe closed 2 years ago

psteinroe commented 2 years ago

Hi,

As pointed out in the FAQ, email engine is currently not horizontally scalable. Hence, it would be great to have some transparency on

Thanks!

andris9 commented 2 years ago

It's a good question. Regarding benchmarking, I don't have a good answer. Different types of accounts have different requirements. For example, it is way "cheaper" resource-wise to process a Gmail account than it is to sync a Yahoo account. So it depends on what kind of accounts exactly are you processing. To run any meaningful benchmarks I would need to have access to thousands of email accounts which is kind of tricky.

Horizontal scaling is something that I've considered since day one, but it hasn't really been an actual issue so far. It's more like a "what if" kind of problem and there are always more acute things to do, like adding support to a specific kind of OAuth2 configuration or improving the general stability of message syncing.

Two major things to implement to achieve horizontal scaling would be a) load distribution, so that IMAP connections would be divided between different instances effectively, and b) data routing, as you wouldn't be sending your API requests to the actual host that runs the IMAP command but to a proxy of some kind. That proxy would know where the requested IMAP connection currently resides and would handle communication between the original API request and the IMAP session running on a separate machine.

On that scale where you need scaling, you probably would have issues storing all the syncing data to a single Redis instance as well. You need around 40 bytes of Redis storage (=RAM) per every email that is stored on the registered email accounts. So I should at least add support to Redis Clusters, or replace Redis with a disk-based DB.

psteinroe commented 2 years ago

Thanks for the elaborate answer!

I understand that a benchmark would be tricky. As far as I can tell, one can create multiple accounts with the same server settings - maybe that would allow a small benchmark?

And thanks for the info on what will be an issue first. I guess we do not have to worry about it for the first few hundred accounts. When implementing load distribution and routing, a managed offering would probably make sense.

And I like the idea of allowing different back-ends. For us, replacing Redis with Postgres (e.g. using pg-boss as a replacement for Bull) would be great, since its cheaper and can scale indefinitely. Did you think about implementing some kind of db adapter interface?

andris9 commented 2 years ago

I'm looking for ways to replace the current Redis implementation because of the storage limitations – threading support is pretty much impossible right now due to the size of the data that would have to be stored for each message. Not sure about the exact details yet.

The main problem is the number of queries executed. That's also the reason I went with Redis first. Due to how IMAP gives out details about messages, it is challenging to optimize queries. EmailEngine runs a command for every email in a folder when syncing data to find differences. An account with 100 000 emails means 100 000 DB queries. Not an issue with Redis, it might be an issue with other DBs. I'm also thinking of using multiple data stores, for example, Redis for the main index and something else to store all the additional metadata.

andris9 commented 2 years ago

I'll close this for now. FYI, it is challenging to replace Redis with a "normal" database because I haven't found a scalable solution for sequence-number-based queries. For example, when an email is deleted, the server sends a notification * 123456 EXPUNGE where 123456 is the sequence number of the deleted message. Redis with sorted sets resolves the actual message with O(log n). Doing the same thing in an RDBMS is way slower as you usually have to use offsets (e.g. LIMIT 123456,1) or RANK functions that are too slow as there are a lot of queries.

andris9 commented 2 years ago

Update. I did some simple benchmarking. Running EmailEngine on a $20/mo DigitalOcean VPS (4GB RAM, 2 cores).

1000 email accounts with ~1300 emails/150MB each. No major issues running EmailEngine.

Everything is synced, not much is going on with the accounts, all connections open:

Screenshot 2022-02-09 at 19 16 30

Actively syncing accounts:

Screenshot 2022-02-09 at 19 16 22

Minor issues:

psteinroe commented 2 years ago

That's impressive! Thanks for looking into it!

tomcon commented 2 years ago

Just a small point but you could use a drop down list in the middle of your paging component that would take very little space and allowed you to jump to any given page easily

eg. You'd have something like: first page, prev page, drop down list, next page, and last page links