routing: delay pruning closed channels from the channel graph by 12 blocks

Roasbeef commented 1 year ago

Spec Change & Potential Implications

A bit over a year ago a changed was merged into the spec to advise nodes to wait 12 blocks before removing spent funding outputs. This has implications for splciign, as if nodes just wait before removing a channel, then from the PoV of path finders, it's as if the channel never closed. In this case, the only thing lost then is channel age, which some (?) algos use as a proxy for reliability. We should catch up to observe this while also thinking about path finding and re-org implications.

On the path finding side, if we see a channel X with 1 BTC, which is then "closed", and replaced with a channel with 1.1 BTC, for the 12 block interval, would we then see a 2.1 BTC link between the two parties, or a 1.1 BTC link? All our path finding algos now use the capacity to factor into the apriori probability, so done in an iterative manner, we'd think the channel is getting large and larger, and try to send payments through that would likely fail. Does this matter much in practice?

Re re-orgs, it's possible that a channel is closed, a few blocks elapse, then the close itself is re-org'd out. In this case, we haven't yet removed the channel yet as it's still valid, but would the nodes continue to observe the channel as if it was never closed?

Architecture Overview

The FilteredChainView is used today to handle removing spent channels for a given block. Upon start up we'll scan forward in the chain to find all closed channels since we were down. This logic is also re-org aware, as it's able to detect we were on a stale chain to ensure we consume the proper blocks. Once caught up, for each new block, we'll prune the closed channels from the channel graph.

Suggested Implementation Path

Given the above, it may make the most sense to actually implement this logic on the filtered block level. So rather than notify for the chain tip, it first gives a notification for 12 blocks back, then has a lagging pointer from there on.

The catch up logic would then just (?) lag behind by 12 blocks, reducing the bounds of this loop: https://github.com/lightningnetwork/lnd/blob/1871970543c92d5c833205fa5f13b426c6b88132/routing/router.go#L803-L815

This change would also affect most of our itests as well, since they always assert that the channels aren't present in the graph right after the close tx is mined. We can either add a build tag to ignore this for itests, or have every test mine 12 blocks at the end, then assert that all the channels are gone.

saubyk commented 1 year ago

On the path finding side, if we see a channel X with 1 BTC, which is then "closed", and replaced with a channel with 1.1 BTC, for the 12 block interval, would we then see a 2.1 BTC link between the two parties, or a 1.1 BTC link? All our path finding algos now use the capacity to factor into the apriori probability, so done in an iterative manner, we'd think the channel is getting large and larger, and try to send payments through that would likely fail. Does this matter much in practice?

Wouldn't this mean that nodes with spliced channels would experience a dip in their routing traffic for a while?

Crypt-iQ commented 1 year ago

Is 12 blocks enough in high fee environments?

Roasbeef commented 1 year ago

Is 12 blocks enough in high fee environments?

I don't think the fee confirmation delay matters, what does matter is the amount of time that elapses between the splice transaction getting confirmed and a sender seeing the new channel announcement. While the splice transaction is unconfirmed, the sender still sees the channel as being open, once it's closed though, then as is they'll remove it from the graph.

The key assumption here is that "everyone the matters" will see the new channel announcement within 12 blocks of the splice transaction (seen as a close to the uninitiated) confirms.

Roasbeef commented 1 year ago

Thinking about it a bit more, I think the concerns re path finding only apply for the 12 block interval that an observant client detects both channels as being usable. If the client recognizes that channel 2 actually spends channel 1, then they can update their review and use the capacity for channel 2.

lightningnetwork / lnd