SpvClient loses the connection chain and stops providing blocks

mariocynicys commented 1 year ago

This is a hard to produce issue but basically what happens is that bitcoind prunes old block which aren't yet delivered to the tower. Thus the tower stops connecting blocks (watching).

Repro:

Keep your node shut for a couple of weeks/days (yeah really)
Start the node and the tower together (so the tower doesn't lag behind the prune length)

What will happen:

The tower will be able to go past the 100 blocks check when it's bootstrapping since these blocks aren't pruned yet
bitcoind should be (so) slow responding to RPC calls after sometime, probably because it's busy validating blocks
The tower will hang waiting for RPC calls from bitcoind but bitcoind will move quickly that it prunes blocks that the tower hasn't yet received.

At this point spv_client.poll_best_tip will stop connecting blocks (blocks are connected sequentially, if one is missing we can't connect later ones), which is indicated by the boolean returned.

    /// Polls for the best tip and updates the chain listener with any connected or disconnected
    /// blocks accordingly.
    ///
    /// Returns the best polled chain tip relative to the previous best known tip and whether any
    /// blocks were indeed connected or disconnected.
    pub async fn poll_best_tip(&mut self) -> BlockSourceResult<(ChainTip, bool)> { ... }

The tower will not get any blocks after this point nor will it report errors.

Such an issue could be triggered with the loss of internet connection of some long time. So it might be worth resolving it automatically and not requiring manual interference.

mariocynicys commented 1 year ago

It's all Updating best tip logs after this. Note: This tower isn't the master one.

anipaul2 commented 1 year ago

Are there any existing error-handling mechanisms in place within the SpvClient or related components to handle scenarios where blocks are not delivered due to pruning?

mariocynicys commented 1 year ago

Are there any existing error-handling mechanisms in place within the SpvClient or related components to handle scenarios where blocks are not delivered due to pruning?

They don't consider it errors per se, but they do report it back to the caller with the boolean in BlockSourceResult<(ChainTip, bool)>. So we should be able to recover-from/react-to that.

anipaul2 commented 1 year ago

If the boolean value indicates that blocks were disconnected, can we retry an attempt to fetch and connect the missing the blocks again?

mariocynicys commented 1 year ago

If the boolean value indicates that blocks were disconnected, can we retry an attempt to fetch and connect the missing the blocks again?

The expected action here is blocks getting connected or disconnected. If one of these things happen the boolean should be true. If the best tip fetched but without any blocks being connected or disconnected, that's the bad case. Retrying will probably do nothing since the blocks are pruned already. We can either report the issue to the user or automatically move the spv client's tip forward to a non-purged block and risk not connecting all the block in between.

sr-gi commented 1 year ago

We may be able to fix this by checking whether we are in IBD or not. bitcoind defaults to report to be in IBD if the node is started and the chain is lagging behind for longer than 24h (the backend tip stalled for more than a day). This is checked only on bootstrap and once it latches to false it will not change back to true while running, even if all peers disconnect from us and we don't get any data for longer than a day. This should not be an issue for us though.

We could either deny running if that is the case or wait until the backend catches up. This is reported by getblockchaininfo which we happen to currently call when starting the tower in order to check what chain we're running in. We may need to update the wapper to return both the chain and whether we are in IBD. Furthermore, we could have some specially handling case if we are in regtest or something, given this may not be as relevant in that case and may trigger more often than not.

sr-gi commented 1 year ago

Here's a PoC for this: https://github.com/sr-gi/rust-teos/tree/ibd-abort. @mariocynicys if you still have a copy of the chain that was triggering this error, would you mind testing it out (assuming you're ok with the approach)?

talaia-labs / rust-teos

SpvClient loses the connection chain and stops providing blocks #223