hicommonwealth / commonwealth

A platform for decentralized communities
https://commonwealth.im
GNU General Public License v3.0
67 stars 44 forks source link

[SPIKE] Moving thread content to external blob storage #8006

Closed timolegros closed 2 months ago

timolegros commented 4 months ago

Description

Given the recent LayerZero performance issues related to large threads and search indexing we should explore solutions that allow us to host static/immutable thread content on a storage service optimized for blobs/documents. Options include but are not limited to:

When selecting an external storage solution we should consider:

Note: the scope of this ticket includes researching whether search would integrate cleanly with the external storage solution but does not include actually migrating search. That is, moving thread content to an external service is separate from moving search to an external service i.e. we can keep search indexes in Postgres while hosting thread content elsewhere.

Outcomes

timolegros commented 3 months ago

Stats

Threads Comments
Count 22416 85255
Size (MB) 285 170

The size here is the size of the version_history column which contains the current version and all edited versions (except for ~2k threads).

The approximate size of a record if excluding TEXT columns is 1386 bytes for Thread records and 1576 bytes for Comment records. PostgreSQL will automatically TOAST column values over 2kb.

Why S3, MongoDB, Firestore?

AWS S3

MongoDB

Firestore

Notes

We should discuss this further in platform office hours.

timolegros commented 3 months ago

Adding a note:

Kurtis brought up a good point that Cloudflare R2 is basically the Cloudflare version of S3 and it has no egress data transfer fees so it could be a potentially much cheaper option than S3.

burtonator commented 3 months ago

Just FYI Firestore has a 1MB limit and they also perform indexing over JSON fields. It wouldn't be too bad if we stored the data as a raw string though.

That said, Firestore is probably not the best solution for this. The Google blob storage equivalent is Google Cloud Storage which is more analogous to S3.

Honestly I think the best solution is to use S3 with SSD storage and then sync to S3 or a vector search database.

We don't really use AI embeddings so that's not much of an issue to have a vector search backend but if we're not using fielded search then we don't gain many of the benefits of Elasticsearch.

ES can store documents too and doesn't need to just be a raw search index but S3 is much faster for key lookup than ES and should be more cost effective

timolegros commented 3 months ago

After a lengthy discussion with the platform team here is what we concluded.

We agreed that our top priority is to ensure we can scale and that the app can remain stable under load similar to what we experienced with layerzero after contests is deployed.

We have eliminated Firebase as a possible option since performance is out of our control and Kevin has experienced difficulties with it in the past.

Note: S3/R2 are analogous. We likely want to use R2 rather than S3 since R2 has no egress fees and using S3 may go against Cloudflare TOS. That said, we have not deeply investigated R2 yet so it is not confirmed.

Here is how the platform team would like to proceed:

  1. Temporarily (or permanently - see discussion below) remove full-text search on thread bodies (maintain search over thread titles). This would significantly reduce database load since in addition to removing expensive full-text search queries, we can drop the plaintext and _search columns thus halving the size of Thread and Comment records. This a step we can obviously only do if product approves. The idea would be to reintroduce full-text search when we have a better setup with S3 or MongoDB.
  2. Move large thread content to S3/R2. While we considered MongoDB and some of it's strengths (integration full-text search + queryability), it would take longer to integrate and the performance boosts are not guaranteed. As stated in my comment earlier, S3/R2 is a quick and dirty solution that is sure to increase performance for large threads. This solution is 'dirty' because it requires 2 full roundtrips to fetch threads (fetch id from Postgres and then fetch content from S3).
  3. Write load tests for all of the important routes.
  4. [OPTIONAL] Build a quick and dirty MongoDB prototype and use the load tests on it and S3 to compare performance. If performance is comparable, we may choose to transition to MongoDB since it would be easier to reintroduce full-text search.
  5. [OPTIONAL] Reintroduce full-text search with Elasticsearch (AWS OpenSearch) if we continue using S3 or use the integrated full-text search on MongoDB.

If product denies our request to temporarily remove full-text search, we will still move content to S3/R2 and work on load-tests before circling back with MongoDB and full-text search. Note that in this case, performance may be much worse during high load than if we removed full-text search.

Additional conclusions: If we decide to stick to S3 after step 4, we should emit ThreadCreated/CommentCreated events and then setup a worker (or merge with CommonwealthConsumer) to post to Elasticsearch. This is preferable to using a lambda trigger directly from S3 since content that is shorter than X (not defined but could be 2k char) would not be stored in S3 and would remain in Postgres. This way we are updating the search index in an event-driven fashion without affecting the latency of our API routes.

burtonator commented 3 months ago

This solution is 'dirty' because it requires 2 full roundtrips to fetch threads (fetch id from Postgres and then fetch content from S3).

The thing is that this might actually be FASTER than we are now because Postgress would be smaller and the content from S3/R2 (sounds like the name of a character from Star Wars) would resolve in parallel.

This is the big thing we need though. When fetching the content from S3/R2 we need to fetch via Promise.all so it happens in parallel.

Before we decide 100% on MongoDB we should consider a few things:

Elasticsearch is "state of the art" for "bag of word" FTS and Mongo isn't really oriented around FTS so it won't be able to compete if we need advanced FTS.

We might also want to consider making our FTS decoupled from the main write path and write to it via a crawler. Either that or we post to a queue and index via that strategy rather than writing to FTS directly in our main write path.

timolegros commented 3 months ago

The thing is that this might actually be FASTER than we are now because Postgress would be smaller and the content from S3/R2 (sounds like the name of a character from Star Wars) would resolve in parallel.

This is the big thing we need though. When fetching the content from S3/R2 we need to fetch via Promise.all so it happens in parallel.

It would be faster that's the goal with moving to S3/R2 but there is still an upper bound performance due to having to make 2 roundtrips that may be higher than just having to make a single round-trip with another storage strategy.

Elasticsearch is "state of the art" for "bag of word" FTS and Mongo isn't really oriented around FTS so it won't be able to compete if we need advanced FTS.

MongoDB search is built on the same backend as Elasticsearch (Apache Lucene) so I don't think the feature parity is vast:

We might also want to consider making our FTS decoupled from the main write path and write to it via a crawler. Either that or we post to a queue and index via that strategy rather than writing to FTS directly in our main write path.

This is what I noted in the Additional Conclusions section. We would use our existing event infrastructure (relayer + event emission) to async update Elasticsearch if that is the route we go down.

burtonator commented 3 months ago

It would be faster that's the goal with moving to S3/R2 but there is still an upper bound performance due to having to make 2 roundtrips that may be higher than just having to make a single round-trip with another storage strategy.

I think we should be fine. I've deployed this strategy before and it's a winner. The Datastreamer architecture was that we had a 10TB Cassandra index, and a 1PB Elasticsearch index. We didn't use S3 but we would store data directly within Elasticsearch.

IF you have a collection of IDs we can Promise.all to fetch them in parallel.

Also, if we have standard queries we can mirror those in Elasticsearch so that there's no use of Postgress.

We have to use fields in ES and then make the not tokenized . For example we could add COMMUNITY_ID and then order the results so there would be no Postgress involved in the original search.

timolegros commented 3 months ago

Also, if we have standard queries we can mirror those in Elasticsearch so that there's no use of Postgress.

We have to use fields in ES and then make the not tokenized . For example we could add COMMUNITY_ID and then order the results so there would be no Postgress involved in the original search.

Yea, this would be required if we move search to ES.

My main worry is adding additional infrastructure that we would have to maintain just for search whereas with MongoDB, storage + search is one integration and MongoDB Atlas is likely much lower maintenance (AWS always adds significant complexity). Additionally, with MongoDB we don't need async updates to a search integration so it further reduces the load on our event pipeline.

burtonator commented 3 months ago

So you would want to use MongoDB instead of S3/R2 and do both things on one platform?

Note that we can do this with Elasticsearch too. You can store 'fields' in and store the full content there. You can also make it 'indexed only' so that it doesn't have the original. Indexed only is about 40% of the original and content+index is about 140%.

One other point. Lucene is definitely very different than just Elasticsearch. There is a lot of functionality there but they've built in on really amazing features in ES around shards and routing and shard splitting and so forth.

I build a whole engine to do shard splitting in earlier versions of ES but now it's part of the platform.

Maybe we won't need those features in MongoDB but in my experience, once you get a LARGE index you end up needing them or your costs for FTS just become way way way too high. Like 10x too high. The cloud vendors LOVE charging massive amounts of money when you get a large index.

timolegros commented 3 months ago

So you would want to use MongoDB instead of S3/R2 and do both things on one platform?

One other point. Lucene is definitely very different than just Elasticsearch. There is a lot of functionality there but they've built in on really amazing features in ES around shards and routing and shard splitting and so forth.

I'm not tied to MongoDB just pointing out that it does have its advantages, namely reducing engineering load which is a huge advantage given that we are a small team with the need to ship things fast. I won't argue that ES is the best for search but do we really need the most advanced features when all we are currently doing in Postgres is a simple GIN index for FTS.

Maybe we won't need those features in MongoDB but in my experience, once you get a LARGE index you end up needing them or your costs for FTS just become way way way too high. Like 10x too high. The cloud vendors LOVE charging massive amounts of money when you get a large index.

It's a good point to bring up with product - does the value of FTS to users justify the cost of having scalable and efficient FTS?

burtonator commented 3 months ago

@timolegros thanks.. yeah I think the main thing we should probably be thinking about is do we need FTS (full-text search) in the first place.

I'd like to argue that the fastest way to implement something is to not implement it in the first place.

I think if it's just a feature matrix checklist then maybe we need to add it but I don't think this is the best way to think about product feature requests.

This is a good overview of how I personally think about it:

https://blog.superhuman.com/how-superhuman-built-an-engine-to-find-product-market-fit/?utm_source=google&utm_medium=cpc&utm_campaign=20830975083&agid=&utm_term=&gad_source=1&gclid=CjwKCAjwm_SzBhAsEiwAXE2Cv7ZVU_WbcEhHJJsvp3efVaHbt5Moa_qPASGwYZyGLpK7wnioMj9vkRoCBVQQAvD_BwE

I think if we were to rank all of our features, then FTS would be somewhere in the long tail. Like around feature 30-40 but if you factor in the complexity of this it looks really difficult to implement.

image

So we're on the same page... it would be in the yellow portion on the right.

The green is where we're going to find product market fit and those features are what are customers are going to love us for.

FTS is deceptively complicated to implement and is going to require a lot of time.

If we just move the content out of Postgres, and into S3/R2 that's going to be pretty straight forward. PG will just store the pointers into S3 and it will dramatically reduce the in-memory footprint of PG.

Fetching the content from S3/R2 can be done in parallel.

This cutover is very simple and we can get about 90% of the advantages that we want here.

If we screw up and it turns out FTS is something then we can add it after the fact.

If we do want to implement FTS, we could implement it directly from our sitemap. This would increase the value of our sitemap implementation bit it would also mean we could just use something like Algolia in the short term. Algolia is kind of expensive but it supports indexing via sitemaps.

Also, one issue with a FTS implementation is re-indexing. At some point we're going to want to re-index our FTS index. There might be an algorithm change or new fields we want to index. IF we go with an event sourcing pattern to index the content we wouldn't have this feature but indexing via sitemaps allows us to decouple our main write pipeline and make our FTS eventually consistent.

Another point. Since FTS is current implemented now (just on PG) then we could remove it from the frontend and see if users complain.

I suspect that if we disable it we're not going to see many complaints.

This would yield a lot of cost and time savings here.

Totally happy to be wrong here. FTS is a great feature to have - I just think it's going to be harder than we think.

timolegros commented 2 months ago

Alright so update here: search cannot and will not be disabled even temporarily. Transitioning to S3 will require some changes to the way we index for search in Postgres as described in this ticket: https://github.com/hicommonwealth/commonwealth/issues/8657.

The plan is to maintain search functionality in Postgres in the short term and evaluate transitioning to Elasticesearch or other dedicated search solution after the S3 migration is complete.