hicommonwealth / commonwealth

A platform for decentralized communities
https://commonwealth.im
GNU General Public License v3.0
67 stars 44 forks source link

KILL CHAIN-EVENTS SERVICE :gun: #8343

Open timolegros opened 3 months ago

timolegros commented 3 months ago

Description

Given that we are killing governance events as part of #8325, we won't have any events on non-Alchemy chains that we need to listen to so we can limit the scope of EVM chains to just Alchemy-supported EVM chains. This means we can replace chain-events with an infinitely more scalable and reliable chain-event processing system using Alchemy Webhooks or WebSockets. This would be a massive win since it hugely cuts down on development time (no more maintaining chain-events infrastructure or its complex integration tests).

In my opinion, Webhooks would be much simpler to implement and maintain since we can harness the Heroku load-balancing + routing functionality to ensure a Webhook request is processed efficiently only once. This means that as the API scales, so does it's ability to handle Alchemy Webhook requests. With Websockets we would still need to run a single dyno that maintains the Websocket connection. Additionally, retries with Websockets may not (researching this) be as robust as webhook retries (see below). Both methods would bring significant cost savings both in terms of Alchemy compute unit usage and developer time.

Webhooks

Pros

Cons

Rev 3: (NOT A CON ANYMORE) Alchemy recently released an API for CRUD operations on Custom GraphQL Webhook queries which we can use to programmatically update (or create new webhooks) the events we parse from each block. This avoids having to get all events from each block therefore further reducing costs.

Websockets (basically just pros/cons flipped)

Pros

Cons

Note that we have many correlated events that must be processed in order. To handle cases where events are received out of order (e.g. Webhook failure during server restart/downtime), we would need a strategy to delay the processing of specific events until the origin/required event is processed. We can achieve this with a simple Postgres table queue. For example, suppose we had a OutOfOrderEvents table. Consider the following scenario:

  1. A user creates a contest on-chain
  2. Commonwealth API goes down immediately after
  3. Alchemy Webhook fails with 500 error since the Commonwealth API is down.
  4. API comes back online
  5. User creates a thread on a contest topic -> ThreadCreated event is emitted (@rbennettcw is this even possible if the contest projects have not yet been updated?)
  6. While processing the new thread event, the contest is not found in the DB
  7. Insert ThreadEvent into the DB with some data about the associated contest and dequeue the event
  8. Webhook retry request for the new contest event from Alchemy comes in a couple of minutes later
  9. While processing the new contest event, check the table for correlated events that may be dependent on the new contest event
  10. Finish processing the contest event and then requeue the dependent events.

The strategy above is simple, scalable, and easy to maintain and could be used to ensure any correlated events are always processed in order no matter when they are emitted by Alchemy. This strategy avoids having to drop events (if out of order) or migrate events (after server downtime). Note that technically, out-of-order events between chain events and forum events are currently possible so this may be a solution we want to implement instead of infinitely retrying/requeuing events until an origin event is found/processed.

Transitioning to Webhooks (TBD)

  1. Refactor contest and chain-event policies into a single tRPC command.
  2. Remove the chain-events service.
  3. Remove chain-event-related RabbitMQ configuration.

Additional context

Retry strategy for Webhooks: image

Backfilling strategy for Websockets: WIP (researching)

Tagging @jnaviask @Rotorsoft @kurtisassad @ForestMars @rbennettcw @ianrowan for visibility/feedback.

jnaviask commented 2 months ago

@timolegros we're good to proceed = no plans for Blast at this time.

timolegros commented 2 months ago

Part 2 of this saga is up with #8732. This implements all the event parsing for the webhook endpoint. Part 3 will:

At this time we will not need any special event ordering mechanism as the automatic retries of event processing is sufficient. Part 3 will transition us from CE v2 to CE v3 and part 4 will remove CE v2 and associated code.

timolegros commented 2 months ago

Update: chain re-orgs will be handled as part of https://github.com/hicommonwealth/commonwealth/issues/8764