lightningnetwork / lnd

Lightning Network Daemon ⚡️
MIT License
7.67k stars 2.07k forks source link

[bug]: LND unable to establish or maintain certain p2p connections #8384

Closed SachinMeier closed 9 months ago

SachinMeier commented 9 months ago

Background

We have a public node with many channels and many peers. Over the last 7-10 days, this node has seen a significantly heightened failure rate when attempting to open channels to peers with whom we have connected and opened channels in the past.

The main thing we've noticed in our LND logs is large batches of these logs consecutively, with different IP addresses.

2024-01-15 12:24:45.868 [ERR] BTCN: Can't accept connection: unable to accept connection from <IP>:<PORT>: read tcp <OUR_IP>:9735-><IP>:<PORT>: i/o timeout

In many of these cases, we have run `nc -vvvz and been able to establish TCP/IP connections to these peers, but LND still fails to open a connection or channel. To us, this indicates that it is not a network/firewall issue, but an LND issue.

This issue has not affected all of our peers, so the node works well for some peers, but is non-operational for others.

Your environment

Steps to reproduce

Unsure. Our node was working fine for quite awhile and we don't believe we changed anything around the time the issues started.

Expected behaviour

OpenChannel, Disconnect, and Connect should work as expected.

Actual behaviour

1) Connect returns an empty JSON map {} when we attempt to reconnect to a peer we are failing to open a channel to. However, the peer does not appear in our listPeers call, and any channels we have with them remain active: false. 2) Disconnect returns successfully when we call disconnect on a peer, but the peer somehow remains in the result of the listpeers call. 3) Any calls to open a channel (via RPC) fail due to timeout.

Roasbeef commented 9 months ago

Does this issue persist if you upgrade to 0.17.3? That includes some existing p2p related bug fixes.

Given that you're on 0.16.4, then it would appears that https://github.com/lightningnetwork/lnd/issues/8379 doesn't apply here.

Disconnect returns successfully when we call disconnect on a peer, but the peer somehow remains in the result of the listpeers call.

If either peer has auto reconnect active, then they'll reconnect again right away (the default for channel peers).

When you receive an openchannel call that fails due to a timeout, can you obtain a goroutine dump?

version of btcd, bitcoind, or other backend: bitcoind 22.0

Also once you update to 0.17.3 (includes the mempool scanning logic, which can help reduce force closes and make your node more secure in times likethese), you should also update to bitcoind > 24. Bitcoind 24 and beyond has an RPC call that lets you find a spend in the mempool which makes the entire process more efficient.

SachinMeier commented 9 months ago

This was a recent log from the same node. I try to open a channel to the node, and it says disconnected, but then I try to connect and it says already connected to peer

$ ./lncli openchannel 03d607f3e69fd032524a867b288216bfab263b6eaee4e07783799a6fe69bb84fac --local_amt 50000000 --sat_per_vbyte 100
[lncli] rpc error: code = Unknown desc = peer 03d607f3e69fd032524a867b288216bfab263b6eaee4e07783799a6fe69bb84fac disconnected

$ ./lncli connect 03d607f3e69fd032524a867b288216bfab263b6eaee4e07783799a6fe69bb84fac@54.155.41.207:9735
[lncli] rpc error: code = Unknown desc = already connected to peer: 03d607f3e69fd032524a867b288216bfab263b6eaee4e07783799a6fe69bb84fac@54.155.41.207:9735
SachinMeier commented 9 months ago

Thanks for the info @Roasbeef. Unfortunately, we won't be able to update our LND nodes to 17.3 just yet, but I will reply here when we have.

Roasbeef commented 9 months ago

This was a recent log from the same node. I try to open a channel to the node, and it says disconnected, but then I try to connect and it says already connected to peer

Do you see the peer rapidly connecting and disconnecting in the logs?

SachinMeier commented 9 months ago

I don't think we do. How would we determine if they were doing that frequently? Our logs mostly say:

2024-01-15 12:24:45.868 [ERR] BTCN: Can't accept connection: unable to accept connection from <IP>:<PORT>: read tcp <OUR_IP>:9735-><IP>:<PORT>: i/o timeout
SachinMeier commented 9 months ago

If either peer has auto reconnect active, then they'll reconnect again right away (the default for channel peers).

In several cases, our (broken) node lists peers as connected and their channels active even though those peers do not see a connection and see the same channels as inactive.

ziggie1984 commented 9 months ago

can you still interact with the node fluently using lncli ?

dekdynamics commented 9 months ago

The Wall can offer to troubleshoot the other side of this. Let me know what is needed.

dekdynamics commented 9 months ago

Here is my side of the connection with River Financial 2:

2024-01-17 04:00:03.754 [INF] SRVR: Established connection to: 03aab7e9327716ee946b8fbfae039b0db85356549e72c5cca113ea67893d0821e5@34.136.230.235:9735
2024-01-17 04:00:03.754 [INF] SRVR: Finalizing connection to 03aab7e9327716ee946b8fbfae039b0db85356549e72c5cca113ea67893d0821e5@34.136.230.235:9735, inbound=false
2024-01-17 04:00:03.799 [INF] PEER: Peer(03aab7e9327716ee946b8fbfae039b0db85356549e72c5cca113ea67893d0821e5): disconnecting 03aab7e9327716ee946b8fbfae039b0db85356549e72c5cca113ea67893d0821e5@34.136.230.235:9735, reason: unable to start peer: unable to read init msg: read next header: EOF
2024-01-17 04:00:03.800 [ERR] RPCS: [/lnrpc.Lightning/SendCustomMessage]: peer 03aab7e9327716ee946b8fbfae039b0db85356549e72c5cca113ea67893d0821e5 disconnected
2024-01-17 04:00:03.932 [INF] SRVR: Established connection to: 03aab7e9327716ee946b8fbfae039b0db85356549e72c5cca113ea67893d0821e5@34.136.230.235:9735
2024-01-17 04:00:03.932 [INF] SRVR: Finalizing connection to 03aab7e9327716ee946b8fbfae039b0db85356549e72c5cca113ea67893d0821e5@34.136.230.235:9735, inbound=false
2024-01-17 04:00:03.978 [INF] PEER: Peer(03aab7e9327716ee946b8fbfae039b0db85356549e72c5cca113ea67893d0821e5): disconnecting 03aab7e9327716ee946b8fbfae039b0db85356549e72c5cca113ea67893d0821e5@34.136.230.235:9735, reason: unable to start peer: unable to read init msg: read next header: EOF

It repeats over and over.

ziggie1984 commented 9 months ago

My gut feeling says River2 is in a deadlock ..., thats why I am asking whether they can interact with the RPC API still, moreover do you run pruned ?

TrezorHannes commented 9 months ago

This was a recent log from the same node. I try to open a channel to the node, and it says disconnected, but then I try to connect and it says already connected to peer

Do you see the peer rapidly connecting and disconnecting in the logs?

At least from my side, my LND 17.3 is trying to reconnect in rapid succession, yes. Further debug.log attached, here's the excerpt.

sudo tail -f .lnd/logs/bitcoin/mainnet/lnd.log | grep -E '03aab7e9327716ee946b8fbfae039b0db85356549e72c5cca113ea67893d0821e5|afba2cf3d20d
5f0b36fd2ad56917977b9b0634416b57978999fd0c98a2b0fd42:3'
[sudo] password for admin: 
2024-01-13 19:47:05.247 [INF] SRVR: Established connection to: 03aab7e9327716ee946b8fbfae039b0db85356549e72c5cca113ea67893d0821e5@34.136.230.235:9735
2024-01-13 19:47:05.247 [INF] SRVR: Finalizing connection to 03aab7e9327716ee946b8fbfae039b0db85356549e72c5cca113ea67893d0821e5@34.136.230.235:9735, inbound=false
2024-01-13 19:47:05.247 [DBG] PEER: Peer(03aab7e9327716ee946b8fbfae039b0db85356549e72c5cca113ea67893d0821e5): Sending Init to 03aab7e9327716ee946b8fbfae039b0db85356549e72c5cca113ea67893d0821e5@34.136.230.235:9735
2024-01-13 19:47:05.378 [INF] PEER: Peer(03aab7e9327716ee946b8fbfae039b0db85356549e72c5cca113ea67893d0821e5): disconnecting 03aab7e9327716ee946b8fbfae039b0db85356549e72c5cca113ea67893d0821e5@34.136.230.235:9735, reason: unable to start peer: unable to read init msg: read next header: read tcp 10.8.1.2:48304->34.136.230.235:9735: read: connection reset by peer
2024-01-13 19:47:35.912 [ERR] RPCS: [/lnrpc.Lightning/DisconnectPeer]: unable to disconnect peer: peer 03aab7e9327716ee946b8fbfae039b0db85356549e72c5cca113ea67893d0821e5 is not connected
2024-01-13 19:47:44.009 [INF] SRVR: Established connection to: 03aab7e9327716ee946b8fbfae039b0db85356549e72c5cca113ea67893d0821e5@34.136.230.235:9735
2024-01-13 19:47:44.009 [INF] SRVR: Finalizing connection to 03aab7e9327716ee946b8fbfae039b0db85356549e72c5cca113ea67893d0821e5@34.136.230.235:9735, inbound=false
2024-01-13 19:47:44.009 [DBG] PEER: Peer(03aab7e9327716ee946b8fbfae039b0db85356549e72c5cca113ea67893d0821e5): Sending Init to 03aab7e9327716ee946b8fbfae039b0db85356549e72c5cca113ea67893d0821e5@34.136.230.235:9735
2024-01-13 19:47:44.136 [INF] PEER: Peer(03aab7e9327716ee946b8fbfae039b0db85356549e72c5cca113ea67893d0821e5): disconnecting 03aab7e9327716ee946b8fbfae039b0db85356549e72c5cca113ea67893d0821e5@34.136.230.235:9735, reason: unable to start peer: unable to read init msg: read next header: read tcp 10.8.1.2:50338->34.136.230.235:9735: read: connection reset by peer
2024-01-13 19:50:06.439 [INF] SRVR: Established connection to: 03aab7e9327716ee946b8fbfae039b0db85356549e72c5cca113ea67893d0821e5@34.136.230.235:9735
2024-01-13 19:50:06.439 [INF] SRVR: Finalizing connection to 03aab7e9327716ee946b8fbfae039b0db85356549e72c5cca113ea67893d0821e5@34.136.230.235:9735, inbound=false
2024-01-13 19:50:06.439 [DBG] PEER: Peer(03aab7e9327716ee946b8fbfae039b0db85356549e72c5cca113ea67893d0821e5): Sending Init to 03aab7e9327716ee946b8fbfae039b0db85356549e72c5cca113ea67893d0821e5@34.136.230.235:9735
2024-01-13 19:50:06.569 [INF] PEER: Peer(03aab7e9327716ee946b8fbfae039b0db85356549e72c5cca113ea67893d0821e5): disconnecting 03aab7e9327716ee946b8fbfae039b0db85356549e72c5cca113ea67893d0821e5@34.136.230.235:9735, reason: unable to start peer: unable to read init msg: read next header: read tcp 10.8.1.2:34030->34.136.230.235:9735: read: connection reset by peer

Hope this helps 👍

2024-01-12_River-Node-2-disconnects.log

2Fast2BCn commented 9 months ago

Maybe verify soft and hard limits? To make sure you aren't hitting resource limitations?

SachinMeier commented 9 months ago

This issue was indeed resolved by upgrading to 16.4. Thank you for the help all.

At all times lncli did "work", although the commands connect and disconnect returned success but did not have any effect.

saubyk commented 9 months ago

This issue was indeed resolved by upgrading to 16.4. Thank you for the help all.

Hi @SachinMeier do you mean upgrading to 17.3?

saubyk commented 9 months ago

Closing this issue. Feel free to reopen, if the problem reoccurs with the latest version of LND. Thanks