Open TheBlueMatt opened 6 months ago
Check that we re-claim payments from other MPP channels if we restart and only managed to claim on one channel, specifically:
This is, somewhat surprisingly, not currently broken, but its quite subtle and I kinda want to block related channels here anyway. Right now, when we do a claim, we hold the total_consistency_lock
read lock for the duration of the MPP claim (across all channels); this is quite important and I don't see this changing. Further, on startup, if we see one monitor with a preimage we'll replay it on all other monitors as long as the pending payment is still there. Thus there's really only two cases here, we either claim the payment via some ChannelMonitor's preimage in which case the ChannelManager replays the whole thing on startup (and the channel that did get updated force-closes) or the ChannelManager has completed a persistence after the claim, at which point it has all the other ChannelMonitors queued up and will replay them through the normal async pipeline.
This relies on another important property that is critical to our payment receipt flow working, but is not documented and we really shouldn't rely on - the ChannelManager
is always persisted between receiving a payment and claiming it. This is a side-effect of the way our lightning-background-processor
works, specifically that we always interleave event processing and ChannelManager
persistence. (@G8XSU we should persist the ChannelManager::claimable_payments
info in ChannelMonitor
s with the HTLCs so that we can recreate the PaymentClaimed
even for users even if the ChannelManager
never gets persisted between first receiving the payment HTLC and claiming it).
If we assume we make that change, the MPP-with-async-persist flow would become unsafe - we could have one channel complete its full claim persist and commitment_signed dance then restart and wouldn't be able to reclaim the payment because the preimage is missing. Thus, I'd like to go ahead and land an RAAMonitorUpdateBlockingAction
for MPP claims, but not store it to disk as we can always regenerate it when we replay the claim.
While we "shipped" async persistence quite a few releases ago, we've always maintained that it is an alpha feature and has some known (and probably unknown) bugs. This is a tracking issue for both, specifically things I think we should audit as well as things I'm confident are broken, or at least at some point or another took a note of because I was confident it was broken at that time.
Things that are probably broken
2023-11-mon-claim-bug branch
branch - #3120)ChannelMonitor
learns a preimage for an outbound HTLC and needs to provide the preimage back to the inbound edge of the forwarded HTLC, theChannelMonitor
cannot be pruned until the inbound edgeChannelMonitor
is persisted. See https://github.com/lightningdevkit/rust-lightning/pull/2167#discussion_r1190243415Things we should audit carefully
Finally, to actually "release" it, we should: