[bug]: pong response failure

AndySchroder commented 2 months ago

Background

peers don't stay connected

Your environment

lnd-v0.18.2

Expected behaviour

peers should stay connected and channels remain active.

Actual behaviour

I have two nodes on a local network, Node A and Node B. Node A has port 9735 open on the firewall. Node B has no open firewall ports. Restarting node B causes it to connect to Node A, but then after a few minutes, I get the following errors and channels go inactive. I just upgraded from v0.16.4-beta.rc1 to lnd-v0.18.2. I believe that it worked fine on v0.16.4-beta.rc1 .

Node A

2024-08-28 13:42:07.420 [WRN] PEER: Peer(B): pong response failure for B@192.168.2.B:57402: timeout while waiting for pong response -- disconnecting 2024-08-28 13:42:07.420 [INF] PEER: Peer(B): disconnecting B@192.168.2.B:57402, reason: pong response failure for B@192.168.2.B:57402: timeout while waiting for pong response -- disconnecting 2024-08-28 13:42:07.420 [INF] PEER: Peer(B): unable to read message from peer: read next header: read tcp 192.168.2.A:9735->192.168.2.B:57402: use of closed network connection 2024-08-28 13:42:07.521 [INF] DISC: Removing GossipSyncer for peer=B 2024-08-28 13:42:07.521 [INF] HSWC: ChannelLink(thechannel:1): stopping 2024-08-28 13:42:07.522 [INF] HSWC: ChannelLink(thechannel:1): exited 2024-08-28 13:42:07.522 [INF] HSWC: Removing channel link with ChannelID(thechannelid)

Node B

2024-08-28 13:42:07.435 [WRN] PEER: Peer(A): pong response failure for A@192.168.2.A:9735: timeout while waiting for pong response -- disconnecting 2024-08-28 13:42:07.436 [INF] PEER: Peer(A): disconnecting A@192.168.2.A:9735, reason: pong response failure for A@192.168.2.A:9735: timeout while waiting for pong response -- disconnecting 2024-08-28 13:42:07.538 [INF] DISC: Removing GossipSyncer for peer=A 2024-08-28 13:42:07.539 [INF] HSWC: ChannelLink(thechannel:1): stopping 2024-08-28 13:42:07.540 [INF] HSWC: ChannelLink(thechannel:1): exited 2024-08-28 13:42:07.541 [INF] HSWC: Removing channel link with ChannelID(thechannelid)

AndySchroder commented 2 months ago

Also, I have another node C. This node is on the same physical machine as node B. Node A and C can communicate together. Wondering if node A is getting confused between node B and C since they have the same IP address? Node C has port 9735 opened on the firewall.

Also, I have another node D. This is on the same physical machine as A. Node D can stay connected to node A.

Both node B and D have nolisten=true set in lnd.conf.

ViktorTigerstrom commented 2 months ago

Hi @AndySchroder,

Could you please check the following just for some initial clarifications:

Does the connection remain up if you remove: nolisten=true On node B?
Alternatively does the connection remain up if you keep the nolisten=true on node B, but never start Node C?

ziggie1984 commented 2 months ago

I just upgraded from v0.16.4-beta.rc1 to lnd-v0.18.2. I believe that it worked fine on v0.16.4-beta.rc1

Since LND 18 we do enforce pong messages and will disconnect the peer if the don't get a reply in 30sec. Something seems not right with the connection.

Can you set the PEER subsystem to trace and provide the logs.

AndySchroder commented 1 month ago

I just upgraded both nodes to lnd-v0.18.3, and I get the same result.

AndySchroder commented 1 month ago

Running lncli debuglevel --level PEER=trace seems to fill up the log so quickly it can't capture the whole event in one file, the log begins to rotate.

AndySchroder commented 1 month ago

I don't know if it was clear from my previous writing, but after going inactive, it never reconnects until lnd is restarted. Seems to me like both peers should periodically retry connecting every few minutes after disconnection.

AndySchroder commented 1 month ago

Does the connection remain up if you remove: nolisten=true On node B?

I set the following in lnd.conf on node B:

#nolisten=true
listen=:9999

I still get the same result.

Note, node B has port 9999 blocked on the firewall, which is what I want.

AndySchroder commented 1 month ago

2. Alternatively does the connection remain up if you keep the nolisten=true on node B, but never start Node C?

I still get the same result.

AndySchroder commented 1 month ago

Here are some more observations. For some reason node A is using IPv6 representation of the IPv4 address. I'm not sure why or if it matters. Also, the Recv-Q seems to be large for some reason on node B, but Send-Q is always 0 on node A. Once the connection goes inactive in lnd, the socket disappears from the ss command's output.

A@A:~$ ss '( sport = :9735 )' dst 192.168.2.B
Netid                      State                      Recv-Q                      Send-Q                                                    Local Address:Port                                                      Peer Address:Port                       Process                      
tcp                        ESTAB                      0                           0                                                 [::ffff:192.168.2.A]:9735                                             [::ffff:192.168.2.B]:57400                                                   

B@B:~$ ss '( dport = :9735 )' dst 192.168.2.A
Netid                        State                        Recv-Q                        Send-Q                                               Local Address:Port                                                  Peer Address:Port                        Process                        
tcp                          ESTAB                        168390                        0                                                     192.168.2.B:57400                                                 192.168.2.A:9735

AndySchroder commented 1 month ago

Okay, here is some new information. Node A and C have TOR installed. It seems as though node C is connecting to node A over TOR. Node B and D do not use TOR. I'd rather node A and C not use TOR to connect to each other, there is no reason to use TOR if they are on the same local network, but it seems that this bug triggered them to fall back to the advertised external address and it forgot the manually connected address.

Also, none of my nodes have externalip as I want to keep them private.

Can someone else test? It seems the problem is the following: One node behind firewall, the other not and no externalip defined on the machine that is not firewalled.

I think there are a few parts to the issue:

Disconnecting for unknown reason
Forgetting (until restart) the successful non tor, non public address connected to, so it won't reconnect.
Preferring a TOR address over a manual peer connection with a local IP address.

AndySchroder commented 1 month ago

Turning off TOR on nodes A and C caused them to automatically use the local network instead.

Also, the connection between node A and B now persists with TOR off!

Notes

Before turning TOR off, I was using tor.skip-proxy-for-clearnet-targets=1.
Node A has not had TOR installed for very long. It's possible I enabled it at the same time as I upgraded to lnd-v0.18.2. Node C has had TOR running for years.

AndySchroder commented 1 month ago

Turning TOR on only on node A results in the connections between node A and B and node A and C to persist. Node A and C connect via the local address and not TOR.

AndySchroder commented 1 month ago

Turning TOR on only on node C results in the connections between node A and B and node A and C to persist. Node A and C connect via the local address and not TOR.

So, it seems there is some kind of problem with running TOR on both machines. I'm guessing the tor.skip-proxy-for-clearnet-targets option is not working right and I'm exposing a bug because I'm using not using public addresses that would be accessible over TOR. I'm wondering if tor.skip-proxy-for-clearnet-targets is being turned off on the whole node if one peer uses TOR. Node A has no other .onion peers.

AndySchroder commented 1 month ago

For completeness, I re-tested with TOR enabled on both A and C and the connections remained stable. A and C are also connected via the local network. Weird, but good.

I now cannot reproduce this problem. It's possible there is some slowness to the gossip of the external TOR address and as soon as node C sees the advertised tor address for node A again, the problem will return.

Will have to report back in a few days.

AndySchroder commented 1 month ago

After 24 hours, node B's channel with node A was disconnected again.

I turned TOR off on node A and restarted LND for it to take affect. Node C was naturally disconnected by this restart, but never reconnected again. Also, node B never reconnected either.

After a while I restarted LND on node B and C. They reconnected on restart and have remained stable with TOR off.

I then restarted node A again, just to see if node B and C would automatically reconnect with TOR off on node A before they initially connected. In fact, node B and C automatically reconnected this time a minute after node A came back up.

So, it seems like the issue is very likely being triggered by TOR.

I don't want to turn TOR off on node C right now for further testing, but I think there is enough details here for someone else to try and reproduce the problem.

ellemouton commented 1 month ago

cc @ProofOfKeags - sounds like a matter of making ping/pong timeouts configurable and perhaps having different defaults there for Tor connections since we expect those to take longer.

AndySchroder commented 1 month ago

I don't think that is the issue here. The machine that is struggling to stay connected is on local gigabit ethernet and should not be using TOR. Latency is about 2ms.

ProofOfKeags commented 1 month ago

This isn't a timeout issue if it's on a local network.

lightningnetwork / lnd