Request on feedback for the polkadot-api "recovery from `stop` events" strategy

josepot commented 8 months ago

We've been working on improving our logic for recovering from "stop" events and encountered some complexities along the way. However, we've identified a potential optimization in our approach, thanks to a detail in the spec. Before proceeding, I wanted to share my strategy and gather any feedback or insights that I might have overlooked.

Current Approach: Our system maintains a data structure that tracks all currently pinned blocks, plus some extra info like ref-counts, pointers to parent and children, etc. When a "stop" event occurs, our current logic terminates all operations on these blocks and resets the state, which is less than ideal.

Proposed Strategy: Instead of terminating operations upon receiving a "stop" event, the new approach involves temporarily pausing these operations, and we will either resume or error that operation depending on the new set of blocks.

Here's how it would work:

Upon receiving a "stop" event, we initiate a new chainHead_follow request (with runtime set to true) to start forming a new set of pinned blocks.
Any pending operations with a block-number smaller than the oldest block present in the new list provided by the initialized event will be immediately errored (and those blocks will be removed from this data-structure).
All those operations that belong to blocks that are now pinned will be progressively resumed.
The crucial moment comes with the first "best-block" event from the new subscription, which reveals the exact overlap between the old and new sets of pinned blocks. Meaning that at that point we can immediately error all operations related to blocks not in the new set.

This approach hinges on the spec's provision that the first "best-block" event effectively communicates the new set of relevant blocks, allowing us to make informed decisions on the fly.

Our actual logic is a tinny bit more complex, because we also deal with the fact that a "stop" event could come before having received the "best-block" event of the new subscription. However, leaving that edge-case aside we would like to ask whether this approach makes sense, and/or whether we should re-think it.

Thanks!

tomaka commented 8 months ago

I would personally keep the current approach. The stop event is never supposed to happen under normal operations.

josepot commented 8 months ago

I would personally keep the current approach. The stop event is never supposed to happen under normal operations.

Then I guess that we will be opening issues on smoldot, because it currently happens under "normal" operations.

Nevertheless, I can't help to wonder: then, what was the point of adding a list of finalizedBlocks into the initialized event? Wasn't the point of that to have the ability to recover from the stop event?

tomaka commented 8 months ago

because it currently happens under "normal" operations.

Smoldot will generate a stop event under normal circumstances when it syncs more than 32 (IIRC) blocks at once. In that case, you wouldn't be able to recover from that anyway.

then, what was the point of adding a list of finalizedBlocks into the initialized event? Wasn't the point of that to have the ability to recover from the stop event?

You can recover from stop event if you really want to, but to me it's not worth doing that, especially when it comes to non-finalized blocks.

paritytech / json-rpc-interface-spec

Request on feedback for the polkadot-api "recovery from `stop` events" strategy #145