Open starius opened 5 months ago
Thanks a lot for the testCase, it helped me play with the code. I was going through the issue and it seems the channel was marked as pending on Bob's side but it is fully open in Alice's side.
When Bob's node loses connection, the channel is still in the pending state ( awaiting at least one confirmation). When connection is lost on bob's side:
channel ready
message to Bob which successfully delivers on Alice's side but Bob never receives it.When Bob gets back connection:
Alice's side:
channel ready
message from Bob, after sending a channel ready
message to Bob, which successfully gets sent from Alice's pov. channelRestablish
message. Then waits for a response from Bob.Bob's side:
ChannelRestablish
message from Bob and cannot reply because it has no active link with Alice. Bob holds the message and waits till it has an active link with Alice to reply.channel ready
message to Alice. At this stage Bob's channel with Alice is no more pending from Bob's pov but still awaiting a channel_ready
message from Alice to complete the funding workflow and have a fully active channel.Alice's side:
channel ready
message sent was not by received BobBob's side:
channel ready
message. There are two ways Alice can receive a channel reestablish message from Bob:
This test case fails with Neutrino and works with other backends because Neutrino is the only backend that lnd does not communicate with through a protocol, neutrino is imported like a library and used in lnd, so shares same connection object with lnd. If lnd has connection issues, it affects neutrino's connection with the miner as well. I guess with other backends since lnd's connection does not affect its connection with the miner, there is some kind of delay delta that prevents the unstable channel creation where there is a channel that is pending on one side but fully open on the other.
I am not too familiar with the code, so there might be something that I am missing/wrong.
Background
I opened a channel from LND using neutrino, got one confirmation and then the network was down for some time. The channel got additional confirmations, enough to be fully open. Networking was fixed and LND synchronized neutrino to latest block, but the channel was still marked as pending. I had to restart the node twice to get the channel into
lncli listchannels
andactive
.I previously described that issue here: https://github.com/lightningnetwork/lnd/issues/8251#issuecomment-1873611686
Your environment
lnd
: v0.17.3-betauname -a
on *Nix):Linux host 6.1.62-1.qubes.fc32.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Nov 14 07:11:33 CET 2023 x86_64 GNU/Linux
btcd
,bitcoind
, or other backend:neutrino
Steps to reproduce
I forked LND and made itest reproducing this issue. Pushed it here:
https://github.com/starius/lnd/tree/neutrino-disconnect-when-confirmed
I added new API method in Dev RPC called ToggleNetwork which breaks all connections (methods Read and Write start returning error). Using this method I simulate network outage for node Bob right after opening the channel with Alice. Then I recover networking using the same API, expecting Bob to catch up.
Expected behaviour
I expect the test to pass.
Actual behaviour
The test fails in majority of cases. Sometimes it passes accidentally.
In Bob's log I see an infinite loop of debug messages:
I changed the test replacing network recovery with two restarts of node Bob (left this in comments in my branch). That test passes.