[bug]: neutrino does not activate channels confirmed when network was down until two restarts

Background

I opened a channel from LND using neutrino, got one confirmation and then the network was down for some time. The channel got additional confirmations, enough to be fully open. Networking was fixed and LND synchronized neutrino to latest block, but the channel was still marked as pending. I had to restart the node twice to get the channel into lncli listchannels and active.

I previously described that issue here: https://github.com/lightningnetwork/lnd/issues/8251#issuecomment-1873611686

Your environment

version of lnd: v0.17.3-beta
which operating system (uname -a on *Nix): Linux host 6.1.62-1.qubes.fc32.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Nov 14 07:11:33 CET 2023 x86_64 GNU/Linux
version of btcd, bitcoind, or other backend: neutrino

Steps to reproduce

I forked LND and made itest reproducing this issue. Pushed it here:

https://github.com/starius/lnd/tree/neutrino-disconnect-when-confirmed

I added new API method in Dev RPC called ToggleNetwork which breaks all connections (methods Read and Write start returning error). Using this method I simulate network outage for node Bob right after opening the channel with Alice. Then I recover networking using the same API, expecting Bob to catch up.

make itest backend=neutrino icase='disconnect while channel confirmed'

Expected behaviour

I expect the test to pass.

Actual behaviour

=== RUN   TestLightningNetworkDaemon
    harness_setup.go:23: Setting up HarnessTest...
    harness_setup.go:36: Prepare the miner and mine blocks to activate segwit...
    harness_setup.go:44: Connecting the miner with the chain backend...
    harness.go:275: Setting up standby nodes Alice and Bob...
    harness.go:314: Finished the setup, now running tests...
=== RUN   TestLightningNetworkDaemon/tranche00/15-of-140/neutrino/disconnect_while_channel_confirmed
    harness_assertion.go:478: 
                Error Trace:    /home/user/lnd/lntest/harness_assertion.go:478
                                                        /home/user/lnd/lntest/harness.go:1230
                                                        /home/user/lnd/lntest/wait/wait.go:51
                                                        /home/user/lnd/lntest/wait/wait.go:27
                                                        /home/user/.goroot/src/runtime/asm_amd64.s:1650
                Error:          timeout
                Test:           TestLightningNetworkDaemon/tranche00/15-of-140/neutrino/disconnect_while_channel_confirmed
                Messages:       timeout waiting for close channel update sent
    harness.go:1243: 
                Error Trace:    /home/user/lnd/lntest/harness.go:1243
                                                        /home/user/lnd/lntest/harness.go:1271
                                                        /home/user/lnd/itest/lnd_open_channel_test.go:803
                                                        /home/user/lnd/lntest/harness.go:353
                                                        /home/user/lnd/itest/lnd_test.go:136
                Error:          Received unexpected error:
                                method did not return within the timeout
                Test:           TestLightningNetworkDaemon/tranche00/15-of-140/neutrino/disconnect_while_channel_confirmed
                Messages:       retry closing channel failed
    harness.go:406: finished test: disconnect_while_channel_confirmed, start height=440, end height=441, mined blocks=1
    harness.go:412: test failed, skipped cleanup
=== NAME  TestLightningNetworkDaemon
    lnd_test.go:144: Failure time: 2024-01-23 16:55:10.003
    lnd_test.go:152: =========> tests finished for tranche: 0, tested 140 cases, end height: 441
--- FAIL: TestLightningNetworkDaemon (82.32s)
    --- FAIL: TestLightningNetworkDaemon/tranche00/15-of-140/neutrino/disconnect_while_channel_confirmed (72.45s)

The test fails in majority of cases. Sometimes it passes accidentally.

In Bob's log I see an infinite loop of debug messages:

2024-01-23 16:55:06.590 [DBG] SRVR: Notifying that peer 03a5b3084fc6486d62509221aa8d810d6ea0ea8d8e9b90b61f4fffed4d040da6bf is online
2024-01-23 16:55:06.601 [DBG] FNDG: Channel(3c0c2ef1c5b4301d33a217f58e923846baced196a706d0217e7feedace37ba8f) with ShortChanID 441:1:0 has opening state channelReadySent
2024-01-23 16:55:06.601 [DBG] SRVR: Notifying that peer 03a5b3084fc6486d62509221aa8d810d6ea0ea8d8e9b90b61f4fffed4d040da6bf is online
2024-01-23 16:55:06.612 [DBG] FNDG: Channel(3c0c2ef1c5b4301d33a217f58e923846baced196a706d0217e7feedace37ba8f) with ShortChanID 441:1:0 has opening state channelReadySent

I changed the test replacing network recovery with two restarts of node Bob (left this in comments in my branch). That test passes.

Thanks a lot for the testCase, it helped me play with the code. I was going through the issue and it seems the channel was marked as pending on Bob's side but it is fully open in Alice's side.

When Bob's node loses connection, the channel is still in the pending state ( awaiting at least one confirmation). When connection is lost on bob's side:

Bob is unable to receive confirmation for the funding tx as connection is lost with the chain backend, so channel is still in pending state on Bob's side (awaiting at least one confirmation of the funding tx).
Alice still receives confirmation for her channel as her connection is not lost like Bob. As a result, the channel is no longer pending on Alice's side. Alice sends a channel ready message to Bob which successfully delivers on Alice's side but Bob never receives it.

When Bob gets back connection:

Alice's side:

Alice is currently waiting for a channel ready message from Bob, after sending a channel ready message to Bob, which successfully gets sent from Alice's pov.
Alice reconnects back to Bob when Bob gets the connection back and as part of the peer start process, tries to sync channel state between them both by sending a channelRestablish message. Then waits for a response from Bob.

Bob's side:

Bob reconnects with Alice but since its channel with Alice is still in a pending state from Bob's pov as a result receives a ChannelRestablish message from Bob and cannot reply because it has no active link with Alice. Bob holds the message and waits till it has an active link with Alice to reply.
Bob receives confirmation for the funding tx and sends a channel ready message to Alice. At this stage Bob's channel with Alice is no more pending from Bob's pov but still awaiting a channel_ready message from Alice to complete the funding workflow and have a fully active channel.

Alice's side:

Alice does not receive any response (channel reestablish message) from Bob and so does not know that a channel_ready message should be resent.
Alice receives a channel_ready message from Bob, at this stage the funding workflow is complete from Alice's pov. In Alice's pov, there is now a fully active channel with Bob. Alice still does not know that the first channel ready message sent was not by received Bob

Bob's side:

Bob is still waiting to have an active link with Alice to reply to Alice's channel reestablish message. If Alice gets a channel Reestablish message from Bob, Alice would resend the channel ready message. There are two ways Alice can receive a channel reestablish message from Bob:
1. Through an active link, to get an active link, Bob needs to have completed the funding working flow by receiving a channel_ready message from Alice (This would not happen as from Alice's pov, the channel_ready message was sent successfully and that workflow is now over).
2. On reconnection with Alice, since the channel is no longer in a pending state from Bob's pov. On reconnecting with Alice and starting up the peer, the brontide subsystem can create an active link with Alice and as part of the process, Bob can now try to initiate syncing channel state with Alice. ~If we are lucky, reconnecting once with Alice would make this happen. We want Alice to send the channel Reestablish message first so that when Bob tries to initiate its own syncing with Alice, the ChannelReestablish message sent by Bob would be like a reply to Alice's. If not, when Bob sends first, Bob would still not be able to respond to Alice's own as it does not still have an active link with Alice~. The links become active when trying to processChanSync. When Alice receives the channelReestablish message from Bob, Alice resends the channel_ready message. On receiving the message Bob's channel with Alice becomes fully open in Bob's pov. Reconnecting happens on restart, I guess that is why it works with restarting Bob's node.

This test case fails with Neutrino and works with other backends because Neutrino is the only backend that lnd does not communicate with through a protocol, neutrino is imported like a library and used in lnd, so shares same connection object with lnd. If lnd has connection issues, it affects neutrino's connection with the miner as well. I guess with other backends since lnd's connection does not affect its connection with the miner, there is some kind of delay delta that prevents the unstable channel creation where there is a channel that is pending on one side but fully open on the other.

I am not too familiar with the code, so there might be something that I am missing/wrong.

lightningnetwork / lnd