lightningnetwork / lnd

Lightning Network Daemon ⚡️
MIT License
7.57k stars 2.06k forks source link

[bug]: neutrino does not activate channels confirmed when network was down until two restarts #8415

Open starius opened 5 months ago

starius commented 5 months ago

Background

I opened a channel from LND using neutrino, got one confirmation and then the network was down for some time. The channel got additional confirmations, enough to be fully open. Networking was fixed and LND synchronized neutrino to latest block, but the channel was still marked as pending. I had to restart the node twice to get the channel into lncli listchannels and active.

I previously described that issue here: https://github.com/lightningnetwork/lnd/issues/8251#issuecomment-1873611686

Your environment

Steps to reproduce

I forked LND and made itest reproducing this issue. Pushed it here:

https://github.com/starius/lnd/tree/neutrino-disconnect-when-confirmed

I added new API method in Dev RPC called ToggleNetwork which breaks all connections (methods Read and Write start returning error). Using this method I simulate network outage for node Bob right after opening the channel with Alice. Then I recover networking using the same API, expecting Bob to catch up.

make itest backend=neutrino icase='disconnect while channel confirmed'

Expected behaviour

I expect the test to pass.

Actual behaviour

=== RUN   TestLightningNetworkDaemon
    harness_setup.go:23: Setting up HarnessTest...
    harness_setup.go:36: Prepare the miner and mine blocks to activate segwit...
    harness_setup.go:44: Connecting the miner with the chain backend...
    harness.go:275: Setting up standby nodes Alice and Bob...
    harness.go:314: Finished the setup, now running tests...
=== RUN   TestLightningNetworkDaemon/tranche00/15-of-140/neutrino/disconnect_while_channel_confirmed
    harness_assertion.go:478: 
                Error Trace:    /home/user/lnd/lntest/harness_assertion.go:478
                                                        /home/user/lnd/lntest/harness.go:1230
                                                        /home/user/lnd/lntest/wait/wait.go:51
                                                        /home/user/lnd/lntest/wait/wait.go:27
                                                        /home/user/.goroot/src/runtime/asm_amd64.s:1650
                Error:          timeout
                Test:           TestLightningNetworkDaemon/tranche00/15-of-140/neutrino/disconnect_while_channel_confirmed
                Messages:       timeout waiting for close channel update sent
    harness.go:1243: 
                Error Trace:    /home/user/lnd/lntest/harness.go:1243
                                                        /home/user/lnd/lntest/harness.go:1271
                                                        /home/user/lnd/itest/lnd_open_channel_test.go:803
                                                        /home/user/lnd/lntest/harness.go:353
                                                        /home/user/lnd/itest/lnd_test.go:136
                Error:          Received unexpected error:
                                method did not return within the timeout
                Test:           TestLightningNetworkDaemon/tranche00/15-of-140/neutrino/disconnect_while_channel_confirmed
                Messages:       retry closing channel failed
    harness.go:406: finished test: disconnect_while_channel_confirmed, start height=440, end height=441, mined blocks=1
    harness.go:412: test failed, skipped cleanup
=== NAME  TestLightningNetworkDaemon
    lnd_test.go:144: Failure time: 2024-01-23 16:55:10.003
    lnd_test.go:152: =========> tests finished for tranche: 0, tested 140 cases, end height: 441
--- FAIL: TestLightningNetworkDaemon (82.32s)
    --- FAIL: TestLightningNetworkDaemon/tranche00/15-of-140/neutrino/disconnect_while_channel_confirmed (72.45s)

The test fails in majority of cases. Sometimes it passes accidentally.

In Bob's log I see an infinite loop of debug messages:

2024-01-23 16:55:06.590 [DBG] SRVR: Notifying that peer 03a5b3084fc6486d62509221aa8d810d6ea0ea8d8e9b90b61f4fffed4d040da6bf is online
2024-01-23 16:55:06.601 [DBG] FNDG: Channel(3c0c2ef1c5b4301d33a217f58e923846baced196a706d0217e7feedace37ba8f) with ShortChanID 441:1:0 has opening state channelReadySent
2024-01-23 16:55:06.601 [DBG] SRVR: Notifying that peer 03a5b3084fc6486d62509221aa8d810d6ea0ea8d8e9b90b61f4fffed4d040da6bf is online
2024-01-23 16:55:06.612 [DBG] FNDG: Channel(3c0c2ef1c5b4301d33a217f58e923846baced196a706d0217e7feedace37ba8f) with ShortChanID 441:1:0 has opening state channelReadySent

I changed the test replacing network recovery with two restarts of node Bob (left this in comments in my branch). That test passes.

Chinwendu20 commented 3 months ago

Thanks a lot for the testCase, it helped me play with the code. I was going through the issue and it seems the channel was marked as pending on Bob's side but it is fully open in Alice's side.

When Bob's node loses connection, the channel is still in the pending state ( awaiting at least one confirmation). When connection is lost on bob's side:

When Bob gets back connection:

Alice's side:

Bob's side:

Alice's side:

Bob's side:

This test case fails with Neutrino and works with other backends because Neutrino is the only backend that lnd does not communicate with through a protocol, neutrino is imported like a library and used in lnd, so shares same connection object with lnd. If lnd has connection issues, it affects neutrino's connection with the miner as well. I guess with other backends since lnd's connection does not affect its connection with the miner, there is some kind of delay delta that prevents the unstable channel creation where there is a channel that is pending on one side but fully open on the other.

I am not too familiar with the code, so there might be something that I am missing/wrong.