Open AndySchroder opened 2 months ago
Also, I have another node C. This node is on the same physical machine as node B. Node A and C can communicate together. Wondering if node A is getting confused between node B and C since they have the same IP address? Node C has port 9735 opened on the firewall.
Also, I have another node D. This is on the same physical machine as A. Node D can stay connected to node A.
Both node B and D have nolisten=true
set in lnd.conf
.
Hi @AndySchroder,
Could you please check the following just for some initial clarifications:
Does the connection remain up if you remove:
nolisten=true
On node B
?
Alternatively does the connection remain up if you keep the nolisten=true
on node B
, but never start Node C
?
I just upgraded from v0.16.4-beta.rc1 to lnd-v0.18.2. I believe that it worked fine on v0.16.4-beta.rc1
Since LND 18 we do enforce pong messages and will disconnect the peer if the don't get a reply in 30sec. Something seems not right with the connection.
Can you set the PEER subsystem to trace and provide the logs.
I just upgraded both nodes to lnd-v0.18.3, and I get the same result.
Running lncli debuglevel --level PEER=trace
seems to fill up the log so quickly it can't capture the whole event in one file, the log begins to rotate.
I don't know if it was clear from my previous writing, but after going inactive, it never reconnects until lnd is restarted. Seems to me like both peers should periodically retry connecting every few minutes after disconnection.
- Does the connection remain up if you remove:
nolisten=true
On nodeB
?
I set the following in lnd.conf
on node B:
#nolisten=true
listen=:9999
I still get the same result.
Note, node B has port 9999
blocked on the firewall, which is what I want.
2. Alternatively does the connection remain up if you keep the
nolisten=true
on nodeB
, but never start NodeC
?
I still get the same result.
Here are some more observations. For some reason node A is using IPv6 representation of the IPv4 address. I'm not sure why or if it matters. Also, the Recv-Q
seems to be large for some reason on node B, but Send-Q
is always 0 on node A. Once the connection goes inactive in lnd, the socket disappears from the ss
command's output.
A@A:~$ ss '( sport = :9735 )' dst 192.168.2.B
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
tcp ESTAB 0 0 [::ffff:192.168.2.A]:9735 [::ffff:192.168.2.B]:57400
B@B:~$ ss '( dport = :9735 )' dst 192.168.2.A
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
tcp ESTAB 168390 0 192.168.2.B:57400 192.168.2.A:9735
Okay, here is some new information. Node A and C have TOR installed. It seems as though node C is connecting to node A over TOR. Node B and D do not use TOR. I'd rather node A and C not use TOR to connect to each other, there is no reason to use TOR if they are on the same local network, but it seems that this bug triggered them to fall back to the advertised external address and it forgot the manually connected address.
Also, none of my nodes have externalip
as I want to keep them private.
Can someone else test? It seems the problem is the following: One node behind firewall, the other not and no externalip
defined on the machine that is not firewalled.
I think there are a few parts to the issue:
Turning off TOR on nodes A and C caused them to automatically use the local network instead.
Also, the connection between node A and B now persists with TOR off!
Notes
tor.skip-proxy-for-clearnet-targets=1
.Turning TOR on only on node A results in the connections between node A and B and node A and C to persist. Node A and C connect via the local address and not TOR.
Turning TOR on only on node C results in the connections between node A and B and node A and C to persist. Node A and C connect via the local address and not TOR.
So, it seems there is some kind of problem with running TOR on both machines. I'm guessing the tor.skip-proxy-for-clearnet-targets
option is not working right and I'm exposing a bug because I'm using not using public addresses that would be accessible over TOR. I'm wondering if tor.skip-proxy-for-clearnet-targets
is being turned off on the whole node if one peer uses TOR. Node A has no other .onion peers.
For completeness, I re-tested with TOR enabled on both A and C and the connections remained stable. A and C are also connected via the local network. Weird, but good.
I now cannot reproduce this problem. It's possible there is some slowness to the gossip of the external TOR address and as soon as node C sees the advertised tor address for node A again, the problem will return.
Will have to report back in a few days.
After 24 hours, node B's channel with node A was disconnected again.
I turned TOR off on node A and restarted LND for it to take affect. Node C was naturally disconnected by this restart, but never reconnected again. Also, node B never reconnected either.
After a while I restarted LND on node B and C. They reconnected on restart and have remained stable with TOR off.
I then restarted node A again, just to see if node B and C would automatically reconnect with TOR off on node A before they initially connected. In fact, node B and C automatically reconnected this time a minute after node A came back up.
So, it seems like the issue is very likely being triggered by TOR.
I don't want to turn TOR off on node C right now for further testing, but I think there is enough details here for someone else to try and reproduce the problem.
cc @ProofOfKeags - sounds like a matter of making ping/pong timeouts configurable and perhaps having different defaults there for Tor connections since we expect those to take longer.
I don't think that is the issue here. The machine that is struggling to stay connected is on local gigabit ethernet and should not be using TOR. Latency is about 2ms.
This isn't a timeout issue if it's on a local network.
Background
peers don't stay connected
Your environment
lnd-v0.18.2
Expected behaviour
peers should stay connected and channels remain active.
Actual behaviour
I have two nodes on a local network, Node A and Node B. Node A has port 9735 open on the firewall. Node B has no open firewall ports. Restarting node B causes it to connect to Node A, but then after a few minutes, I get the following errors and channels go inactive. I just upgraded from v0.16.4-beta.rc1 to lnd-v0.18.2. I believe that it worked fine on v0.16.4-beta.rc1 .
Node A
2024-08-28 13:42:07.420 [WRN] PEER: Peer(B): pong response failure for B@192.168.2.B:57402: timeout while waiting for pong response -- disconnecting 2024-08-28 13:42:07.420 [INF] PEER: Peer(B): disconnecting B@192.168.2.B:57402, reason: pong response failure for B@192.168.2.B:57402: timeout while waiting for pong response -- disconnecting 2024-08-28 13:42:07.420 [INF] PEER: Peer(B): unable to read message from peer: read next header: read tcp 192.168.2.A:9735->192.168.2.B:57402: use of closed network connection 2024-08-28 13:42:07.521 [INF] DISC: Removing GossipSyncer for peer=B 2024-08-28 13:42:07.521 [INF] HSWC: ChannelLink(thechannel:1): stopping 2024-08-28 13:42:07.522 [INF] HSWC: ChannelLink(thechannel:1): exited 2024-08-28 13:42:07.522 [INF] HSWC: Removing channel link with ChannelID(thechannelid)
Node B
2024-08-28 13:42:07.435 [WRN] PEER: Peer(A): pong response failure for A@192.168.2.A:9735: timeout while waiting for pong response -- disconnecting 2024-08-28 13:42:07.436 [INF] PEER: Peer(A): disconnecting A@192.168.2.A:9735, reason: pong response failure for A@192.168.2.A:9735: timeout while waiting for pong response -- disconnecting 2024-08-28 13:42:07.538 [INF] DISC: Removing GossipSyncer for peer=A 2024-08-28 13:42:07.539 [INF] HSWC: ChannelLink(thechannel:1): stopping 2024-08-28 13:42:07.540 [INF] HSWC: ChannelLink(thechannel:1): exited 2024-08-28 13:42:07.541 [INF] HSWC: Removing channel link with ChannelID(thechannelid)