[SPIKE] Moving thread content to external blob storage

timolegros commented 6 months ago

Description

Given the recent LayerZero performance issues related to large threads and search indexing we should explore solutions that allow us to host static/immutable thread content on a storage service optimized for blobs/documents. Options include but are not limited to:

AWS S3
MongoDB
Firebase

When selecting an external storage solution we should consider:

cost
integration difficulty
will the storage solution be conducive to externally hosting search functionality

Note: the scope of this ticket includes researching whether search would integrate cleanly with the external storage solution but does not include actually migrating search. That is, moving thread content to an external service is separate from moving search to an external service i.e. we can keep search indexes in Postgres while hosting thread content elsewhere.

Outcomes

Thorough list of pros and cons for each storage solution
Brief overview of work required to migrate to an external storage solution
Point estimate for a full migration
Recommended path of action (which storage solution seems optimal)

timolegros commented 5 months ago

Stats

	Threads	Comments
Count	22416	85255
Size (MB)	285	170

The size here is the size of the version_history column which contains the current version and all edited versions (except for ~2k threads).

The approximate size of a record if excluding TEXT columns is 1386 bytes for Thread records and 1576 bytes for Comment records. PostgreSQL will automatically TOAST column values over 2kb.

Why S3, MongoDB, Firestore?

S3: because we already use it and it is cheap
MongoDB: because it is the most popular NoSQL DBMS with powerful querying capabilities
Firestore: because it is a popular alternative to MongoDB for unstructured data.

AWS S3

Cost
- Storage: less than 0.023 per month
- Data transfer: 0.09 per GB per month
- We don't have metrics on the number of threads viewed (DD generalizes /threads so many threads could be fetched at once) but suppose we had 100,000 visits in a month and each visit loaded 5% of all threads on Common. The total data transfer cost would be 128$
Integration Difficulty
- S3 is relatively easy to integrate. There is very little infrastructure to set up (create a bucket and link it to the correct policies). Additionally, we already use S3 in multiple different places so we are familiar with how it works.
Search Integrations
- To integrate search with S3 we would need to create an upload trigger that executes a lambda function which in turn updates Elastisearch. This is likely to be a costly yet nigh infinitely scalable search solution.

MongoDB

Cost
- 57$ per month for a dedicated managed MongoDB Atlas instance
Integration Difficulty
- More difficult than Firebase or S3 since we have never used MongoDB so we would need to set up accounts, SDKs, and env var. Additionally, we would need to set up the proper permissions on MongoDB to enable queries from the client.
Search Integrations
- MongoDB Atlas has integrated advanced full-text search capabilities. This would probably be the cheapest search solution other than full-text search within Postgres.

Firestore

Cost
- Virtually free during average load given the generous free-tier
Integration Difficulty
- Between MongoDB and S3 since we already have the SDK integrated on the client but document interactions from the API would be a new use case for Firebase. There is some Firestore related setup to figure out.
Search Integrations
- Search is not integrated in Firestore so we would need to use Elasticsearch or another external service though it is unclear how that integration would work (may need custom data passing modules).

Notes

MongoDB would be an interesting choice because we may want to transition more projections onto it in the future to further increase performance/scalability without sacrificing queryability.
S3 has the cheapest upfront cost (shorter development time) but higher longer-term costs.
S3 is the 'quick and dirty' solution because while it would very easy to implement, it places an upper-bound on performance since it has no complex querying capabilities. Consider this, with S3 we will always need 1 extra roundtrip to fetch thread content since we first need to fetch the s3_id of the thread from Postgres. On the other hand, with Mongo or Firebase we can attach additional data like user_id and community_id which we can index and query against. That said, using Firebase or MongoDB in this way would provide only small performance gains when compared to spinning up a separate Postgres instance and optimizing it for blob content retrieval directly from the client.
When considering our choices, it's important to remember that we will not store all content in external storage. If we keep the first 2k characters of a thread or comment body in Postgres, that means we will have to store 12,305 out of 22,069 threads and 8,248 out of 84,925 comments in external storage.
It's crucial to note that with more control over the configuration parameters of Postgres (e.g. toast_tuple_target) we could scale more efficiently. We have only scratched the surface of what Postgres can do and there are lots of techniques we could use (e.g. partitioning, using domain-specific database, etc) to improve performance without introducing new tools but by moving the database off of Heroku.
Given that we need to anticipate higher activity but we don't have load-testing available yet, it may be worth going with the 'safe' option of using S3. This will likely only minimally hurt performance for low to medium load but it (in theory) should significantly increase the max load we can support while retaining average performance. Following the implementation of load testing and once we have assured better scalability with S3 we could circle-back and quickly test out S3 vs Postgres vs MongoDB. This will either validate our decision to use S3 or we will want to transition to a new DB. If the initial S3 implementation is composable, this should be a fairly easy transition.

We should discuss this further in platform office hours.

timolegros commented 5 months ago

Adding a note:

Kurtis brought up a good point that Cloudflare R2 is basically the Cloudflare version of S3 and it has no egress data transfer fees so it could be a potentially much cheaper option than S3.

burtonator commented 5 months ago

Just FYI Firestore has a 1MB limit and they also perform indexing over JSON fields. It wouldn't be too bad if we stored the data as a raw string though.

That said, Firestore is probably not the best solution for this. The Google blob storage equivalent is Google Cloud Storage which is more analogous to S3.

Honestly I think the best solution is to use S3 with SSD storage and then sync to S3 or a vector search database.

We don't really use AI embeddings so that's not much of an issue to have a vector search backend but if we're not using fielded search then we don't gain many of the benefits of Elasticsearch.

ES can store documents too and doesn't need to just be a raw search index but S3 is much faster for key lookup than ES and should be more cost effective

timolegros commented 5 months ago

After a lengthy discussion with the platform team here is what we concluded.

We agreed that our top priority is to ensure we can scale and that the app can remain stable under load similar to what we experienced with layerzero after contests is deployed.

We have eliminated Firebase as a possible option since performance is out of our control and Kevin has experienced difficulties with it in the past.

Note: S3/R2 are analogous. We likely want to use R2 rather than S3 since R2 has no egress fees and using S3 may go against Cloudflare TOS. That said, we have not deeply investigated R2 yet so it is not confirmed.

Here is how the platform team would like to proceed:

Temporarily (or permanently - see discussion below) remove full-text search on thread bodies (maintain search over thread titles). This would significantly reduce database load since in addition to removing expensive full-text search queries, we can drop the plaintext and _search columns thus halving the size of Thread and Comment records. This a step we can obviously only do if product approves. The idea would be to reintroduce full-text search when we have a better setup with S3 or MongoDB.
Move large thread content to S3/R2. While we considered MongoDB and some of it's strengths (integration full-text search + queryability), it would take longer to integrate and the performance boosts are not guaranteed. As stated in my comment earlier, S3/R2 is a quick and dirty solution that is sure to increase performance for large threads. This solution is 'dirty' because it requires 2 full roundtrips to fetch threads (fetch id from Postgres and then fetch content from S3).
Write load tests for all of the important routes.
[OPTIONAL] Build a quick and dirty MongoDB prototype and use the load tests on it and S3 to compare performance. If performance is comparable, we may choose to transition to MongoDB since it would be easier to reintroduce full-text search.
[OPTIONAL] Reintroduce full-text search with Elasticsearch (AWS OpenSearch) if we continue using S3 or use the integrated full-text search on MongoDB.

If product denies our request to temporarily remove full-text search, we will still move content to S3/R2 and work on load-tests before circling back with MongoDB and full-text search. Note that in this case, performance may be much worse during high load than if we removed full-text search.

Additional conclusions: If we decide to stick to S3 after step 4, we should emit ThreadCreated/CommentCreated events and then setup a worker (or merge with CommonwealthConsumer) to post to Elasticsearch. This is preferable to using a lambda trigger directly from S3 since content that is shorter than X (not defined but could be 2k char) would not be stored in S3 and would remain in Postgres. This way we are updating the search index in an event-driven fashion without affecting the latency of our API routes.

burtonator commented 5 months ago

This solution is 'dirty' because it requires 2 full roundtrips to fetch threads (fetch id from Postgres and then fetch content from S3).

The thing is that this might actually be FASTER than we are now because Postgress would be smaller and the content from S3/R2 (sounds like the name of a character from Star Wars) would resolve in parallel.

This is the big thing we need though. When fetching the content from S3/R2 we need to fetch via Promise.all so it happens in parallel.

Before we decide 100% on MongoDB we should consider a few things:

Are we considering migrating away from Postgres to MongoDB? Probably not but something to think about.
If we're JUST going to use MongoDB for FTS we should really evaluate our FTS needs. Specifically, do we need advanced FTS that Elasticsearch provides or do we need more of vector search. I suspect we do not need any type of vector search not and that we just need FTS.

Elasticsearch is "state of the art" for "bag of word" FTS and Mongo isn't really oriented around FTS so it won't be able to compete if we need advanced FTS.

We might also want to consider making our FTS decoupled from the main write path and write to it via a crawler. Either that or we post to a queue and index via that strategy rather than writing to FTS directly in our main write path.

timolegros commented 5 months ago

The thing is that this might actually be FASTER than we are now because Postgress would be smaller and the content from S3/R2 (sounds like the name of a character from Star Wars) would resolve in parallel.

This is the big thing we need though. When fetching the content from S3/R2 we need to fetch via Promise.all so it happens in parallel.

It would be faster that's the goal with moving to S3/R2 but there is still an upper bound performance due to having to make 2 roundtrips that may be higher than just having to make a single round-trip with another storage strategy.

Elasticsearch is "state of the art" for "bag of word" FTS and Mongo isn't really oriented around FTS so it won't be able to compete if we need advanced FTS.

MongoDB search is built on the same backend as Elasticsearch (Apache Lucene) so I don't think the feature parity is vast:

We might also want to consider making our FTS decoupled from the main write path and write to it via a crawler. Either that or we post to a queue and index via that strategy rather than writing to FTS directly in our main write path.

This is what I noted in the Additional Conclusions section. We would use our existing event infrastructure (relayer + event emission) to async update Elasticsearch if that is the route we go down.

burtonator commented 5 months ago

It would be faster that's the goal with moving to S3/R2 but there is still an upper bound performance due to having to make 2 roundtrips that may be higher than just having to make a single round-trip with another storage strategy.

I think we should be fine. I've deployed this strategy before and it's a winner. The Datastreamer architecture was that we had a 10TB Cassandra index, and a 1PB Elasticsearch index. We didn't use S3 but we would store data directly within Elasticsearch.

IF you have a collection of IDs we can Promise.all to fetch them in parallel.

Also, if we have standard queries we can mirror those in Elasticsearch so that there's no use of Postgress.

We have to use fields in ES and then make the not tokenized . For example we could add COMMUNITY_ID and then order the results so there would be no Postgress involved in the original search.

timolegros commented 5 months ago

Also, if we have standard queries we can mirror those in Elasticsearch so that there's no use of Postgress.

We have to use fields in ES and then make the not tokenized . For example we could add COMMUNITY_ID and then order the results so there would be no Postgress involved in the original search.

Yea, this would be required if we move search to ES.

My main worry is adding additional infrastructure that we would have to maintain just for search whereas with MongoDB, storage + search is one integration and MongoDB Atlas is likely much lower maintenance (AWS always adds significant complexity). Additionally, with MongoDB we don't need async updates to a search integration so it further reduces the load on our event pipeline.

burtonator commented 5 months ago

So you would want to use MongoDB instead of S3/R2 and do both things on one platform?

Note that we can do this with Elasticsearch too. You can store 'fields' in and store the full content there. You can also make it 'indexed only' so that it doesn't have the original. Indexed only is about 40% of the original and content+index is about 140%.

One other point. Lucene is definitely very different than just Elasticsearch. There is a lot of functionality there but they've built in on really amazing features in ES around shards and routing and shard splitting and so forth.

I build a whole engine to do shard splitting in earlier versions of ES but now it's part of the platform.

Maybe we won't need those features in MongoDB but in my experience, once you get a LARGE index you end up needing them or your costs for FTS just become way way way too high. Like 10x too high. The cloud vendors LOVE charging massive amounts of money when you get a large index.

timolegros commented 5 months ago

So you would want to use MongoDB instead of S3/R2 and do both things on one platform?

One other point. Lucene is definitely very different than just Elasticsearch. There is a lot of functionality there but they've built in on really amazing features in ES around shards and routing and shard splitting and so forth.

I'm not tied to MongoDB just pointing out that it does have its advantages, namely reducing engineering load which is a huge advantage given that we are a small team with the need to ship things fast. I won't argue that ES is the best for search but do we really need the most advanced features when all we are currently doing in Postgres is a simple GIN index for FTS.

Maybe we won't need those features in MongoDB but in my experience, once you get a LARGE index you end up needing them or your costs for FTS just become way way way too high. Like 10x too high. The cloud vendors LOVE charging massive amounts of money when you get a large index.

It's a good point to bring up with product - does the value of FTS to users justify the cost of having scalable and efficient FTS?

burtonator commented 5 months ago

@timolegros thanks.. yeah I think the main thing we should probably be thinking about is do we need FTS (full-text search) in the first place.

I'd like to argue that the fastest way to implement something is to not implement it in the first place.

I think if it's just a feature matrix checklist then maybe we need to add it but I don't think this is the best way to think about product feature requests.

This is a good overview of how I personally think about it:

https://blog.superhuman.com/how-superhuman-built-an-engine-to-find-product-market-fit/?utm_source=google&utm_medium=cpc&utm_campaign=20830975083&agid=&utm_term=&gad_source=1&gclid=CjwKCAjwm_SzBhAsEiwAXE2Cv7ZVU_WbcEhHJJsvp3efVaHbt5Moa_qPASGwYZyGLpK7wnioMj9vkRoCBVQQAvD_BwE

I think if we were to rank all of our features, then FTS would be somewhere in the long tail. Like around feature 30-40 but if you factor in the complexity of this it looks really difficult to implement.

So we're on the same page... it would be in the yellow portion on the right.

The green is where we're going to find product market fit and those features are what are customers are going to love us for.

FTS is deceptively complicated to implement and is going to require a lot of time.

If we just move the content out of Postgres, and into S3/R2 that's going to be pretty straight forward. PG will just store the pointers into S3 and it will dramatically reduce the in-memory footprint of PG.

Fetching the content from S3/R2 can be done in parallel.

This cutover is very simple and we can get about 90% of the advantages that we want here.

If we screw up and it turns out FTS is something then we can add it after the fact.

If we do want to implement FTS, we could implement it directly from our sitemap. This would increase the value of our sitemap implementation bit it would also mean we could just use something like Algolia in the short term. Algolia is kind of expensive but it supports indexing via sitemaps.

Also, one issue with a FTS implementation is re-indexing. At some point we're going to want to re-index our FTS index. There might be an algorithm change or new fields we want to index. IF we go with an event sourcing pattern to index the content we wouldn't have this feature but indexing via sitemaps allows us to decouple our main write pipeline and make our FTS eventually consistent.

Another point. Since FTS is current implemented now (just on PG) then we could remove it from the frontend and see if users complain.

I suspect that if we disable it we're not going to see many complaints.

This would yield a lot of cost and time savings here.

Totally happy to be wrong here. FTS is a great feature to have - I just think it's going to be harder than we think.

timolegros commented 4 months ago

Alright so update here: search cannot and will not be disabled even temporarily. Transitioning to S3 will require some changes to the way we index for search in Postgres as described in this ticket: https://github.com/hicommonwealth/commonwealth/issues/8657.

The plan is to maintain search functionality in Postgres in the short term and evaluate transitioning to Elasticesearch or other dedicated search solution after the S3 migration is complete.

hicommonwealth / commonwealth