Event observer recoverability in event of unclean stacks-node shutdown

obycode commented 2 weeks ago

Problem

During the restart to upgrade naka-4 to rc3, we witnessed this situation:

stacks-node processes a block
stacks-node is shutdown before successfully sending the new block event to event observers (API in this case)
stacks-node is restarted
Because the last block was successfully processed, the node does not know that it never successfully sent the block to the event observers, so it proceeds with the next block
API observer errors when it receives the next block, since it never received its parent block
stacks-node is unable to proceed since it does not receive a successful response for the new block event

Proposed solution

Create a new database to store outstanding events
Before attempting to send an event to observers, record the event in this new database
For each event in the database:
- Send the event to all observers
- Delete the event from the database
Proceed after all events have been successfully sent

obycode commented 2 weeks ago

The most obvious place to implement this change is directly in EventObserver::send_payload. This would result in duplicated information in the database if a node has multiple observers, but it would reduce the amount of refactoring required and also give us finer grain info about which observers need events rebroadcasted (only rebroadcast to observers that did not confirm the event last time, instead of always rebroadcasting the event to all observers). In the majority of cases, a node probably has 0 or 1 observers, so there is likely no real difference in practice.

obycode commented 2 weeks ago

This is addressed in #5289.

obycode commented 6 days ago

Merged.

stacks-network / stacks-core

Event observer recoverability in event of unclean stacks-node shutdown #5281

Problem

Proposed solution