[bug]: LND can't connect to peers/channels, sync to chain is lost

Reckless-Satoshi commented 1 year ago

Background

On RoboSats mainnet node, on start up, LND gets to both synced to chain and graph. Connects to a few peers/ channels. Eventually, stops during its way to connect to all peers and channels. It might connect to only 2-30 peers/channels of 140 total channels. Eventually, the number of active channels decays down to 0. Effectively making the node dead. After some time, synced_to_chain goes false.

This the systematic behaviour everytime LND has been restarted for the last 12 hour.

As RoboSats uses mainly hodl invoices, in a few hours many force closures will be triggered and recovery will be painful. Hopefully we can get the node back online first :D

The only interesting thing I see in the logs is a bunch of this (when LND loses chain sync?). But I can grep it if you think I am missing something.

2023-08-24 05:15:28.633 [ERR] BTCN: Can't accept connection: unable to accept connection from 127.0.0.1:45314: write tcp 127.0.0.1:9735->127.0.0.1:45314: write: broken pipe                                        2023-08-24 05:15:28.959 [ERR] BTCN: Can't accept connection: unable to accept connection from 127.0.0.1:45324: write tcp 127.0.0.1:9735->127.0.0.1:45324: write: broken pipe                                        2023-08-24 05:15:29.169 [ERR] BTCN: Can't accept connection: unable to accept connection from 127.0.0.1:45336: write tcp 127.0.0.1:9735->127.0.0.1:45336: write: broken pipe

There is one of these every few miliseconds.

I did already fully reboot the machine to no avail. Also updated docker / docker-compose to latest stable.

Your environment

version of lnd: v0.16.4 Official Docker image
which operating system (uname -a on *Nix) Dockerized on Linux 5.4.0-156-generic #173-Ubuntu SMP Tue Jul 11 07:25:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
version of btcd, bitcoind, or other backend Bitcoin Core 24.0.1
any other relevant environment details

Steps to reproduce

I will probably be unable to reproduce anything that might have happen to arrive to this state.

Expected behaviour

LND should start and connect to all peers/channels.

Reckless-Satoshi commented 1 year ago

Some more context, maybe useful for debugging. RoboSats node is Tor-only. The Tor daemon runs as another container in the orchestration. Possibly these are the 127.0.0.1:***** broken pipes. However, all containers can talk to each other successfully, even LND container does: after every LND container kill and restart it does connect to some peers and get's synced to chain and graph. After some minutes, the broken pipes start showing.

gRPC requests from other containers work as usual. For example, one can create a new order in Robo and the invoice is created.

There is no RAM or disk limitations.

In the same machine, with identical orchestration, there is a tesnet full replicate. It is working as usual.

hieblmi commented 1 year ago

Hi, thanks for reporting the issue. Did this behavior occur the first time after you adjusted settings or did it start without intervention? Would you be able to provide lnd logs that contain the start up as well as showing the decay of active peers? Can also mail the logs if you don't want to attach them here.

Reckless-Satoshi commented 1 year ago

. Did this behavior occur the first time after you adjusted settings or did it start without intervention?

No intervention. No one logged into the machine for a day prior to the issue, no setting was changed for at least +2 months.

Would you be able to provide lnd logs that contain the start up as well as showing the decay of active peers? Can also mail the logs if you don't want to attach them here.

Collecting the logs. Would be great if I can send them by email, where can I send them to?

hieblmi commented 1 year ago

hieblmi at lightning dot engineering

Reckless-Satoshi commented 1 year ago

Sent logs of the latest restart. It connected to +12 channels. Currently only 7 show as active. "synced_to_chain" is false now.

hieblmi commented 1 year ago

Is the wallet ever unlocked, can't see that from the logs.

Reckless-Satoshi commented 1 year ago

Is the wallet ever unlocked, can't see that from the logs.

Yes it is. Is the logs file ~1.68Mb ?

Edit: I believe logs level is set to WARN, will rerun in DEBUG

Reckless-Satoshi commented 1 year ago

Sent new logs with Debug verbosity. This time it seems something did fail before any peer was connected.

Edit: sent another run where it connected to 2 channels before chain sync went false.

Roasbeef commented 1 year ago

Can other services connect to the Tor daemon? If you restart the Tor daemon is the issue resolved? Looks like it isn't able to maintain SOCKs proxy connections.

If you check the logs of the tor daemon, do you see any relevant error logging?

If you used a normal CLI command to establish a TCP connection using the SOCKS proxy, does that work?

Reckless-Satoshi commented 1 year ago

Hey thank you @Roasbeef

Can other services connect to the Tor daemon? If you restart the Tor daemon is the issue resolved? Looks like it isn't able to maintain SOCKs proxy connections.

Every other service can keep connection to the Tor Daemon. Restarting the Tor Daemon, or the whole machine, does not help. But restarting only the LND container will grant at least those initial 5-10 minutes without connection issues where it gets to sync the graph and establish connection to a few peers.

If you check the logs of the tor daemon, do you see any relevant error logging?

It's hard to say as I am not accostumed to look at Tor Daemon logs. There is certainly a huge amount of logs, some of which look like timeouts.

If you used a normal CLI command to establish a TCP connection using the SOCKS proxy, does that work?

From inside the LND container I run curl --socks5-hostname 127.0.0.1:9050 https://example.com, it prints example.com content correctly. It does so even when LND is already uncapable of connecting to peers.

I also tried deleting ./lnd/v3_onion_private_key. At next start up it created a new one. After some minutes, it failed to connect to all peers. So deleting v3_onion_private key did not help.

Roasbeef commented 1 year ago

What does your Tor related config for lnd look like?

Roasbeef commented 1 year ago

write: broken pipe

In the Go runtime, this means that the other side has torn down or reset the TCP connection, but then a program/goroutine tries to write to it.

Roasbeef commented 1 year ago

Also are you able to manually use lncli connect to create outbound connections, either to other tor nodes or clearnet nodes?

Has this been going on for some time? Or did some recent update or config change trigger the behavior?

Reckless-Satoshi commented 1 year ago

Good news and bad news. Good news the node managed connecting to all channels. Bad news, it was not thanks to anything I tried, it simply happened overnight and I do not know why it works now. I have restarted the node 3 times, every single time it connects to all peers.

However, I notice that after ~37 peers, it pauses. There are many broken pipes after 1-2 minutes, it keeps connecting to all peers successfully. I can send logs of these runs if needed.

A wild guess, does LND connect to all peers at once asynchronously in no specific order? or is there a deterministic order or batches (e.g. 10 peers at a time) and if the full batch is unreachable it keeps retrying?

What does your Tor related config for lnd look like?

This is all there is in lnd.conf relating Tor.

tor.active=1
tor.v3=1

Also are you able to manually use lncli connect to create outbound connections, either to other tor nodes or clearnet nodes?

Unfortunately I cannot test now given that the node is back and running as usual. Running lncli connect does work now, but I do not know if it had worked yesterday.

Has this been going on for some time? Or did some recent update or config change trigger the behavior?

This is the first time I observe this behavior. No recent change of settings or updates, in fact, I happened to be in holidays when it started to fail :`)

What I observe now is a few outgoing in_flight payments that were stuck when the issue started, that are still in_flight, yet, the first hop channels on which these are flying show to have 0 total HTLCs (outgoing / incoming) in node managers like Thunderhub. I must say had observe this particular thing happening 2 times in the last 8 months: payments that get stuck in_flight forever, even weeks after the first hop channel was collaboratively closed. This is probably better suited for another issue, however there is quite a few of the same stuck payments at the moment.

Reckless-Satoshi commented 1 year ago

It happened again and I have some more information on this case.

lncli connect <pubkey@host:port> fails with rpc error: code unknown desc = write tcp 127.0.0.1:.....->127.0.0.1:9050: write:broken pipe
lncli addinvoice never returns.
It sometimes stays synced to chain/graph (or at least, it indicates so), but sometimes one of them goes false.
When this issue happened again from a healthy state, the number of active channels stayed at around 120, but the number of peers dropped to 77. The node looked as offline in LN explorers like amboss.space

There is some extra context that might be of interest: The LND node of RoboSats has previously had other issues that I originally thought were unrelated to this one. For example, it sometimes cannot create new invoices (gRPC addinvoice or lncli addinvoice would never return). The first time this happened was time ~6-8 months ago. Yet, a simple restart of the LND container fixed it instantly. This issue started happening a bit more often lately, maybe every 2-5 days. Sometimes even a few times in a day, but then it might go 2-3 weeks without any occurrence.

I did not observe an obvious relationship between the number of pending HTLCs or invoices and the likelihood of LND getting stuck (but there might be a relation!?). I did not report these earlier as I never saw something out of the ordinary in the logs (now I realize the verbosity was set to WARN), it simply stops replying to some RPC calls (e.g. addinvoice, newaddress, settle invoice), it does not stop replying to lncli getinfo, lncli lookupinvoice or listinvoices.

Again, the fact that simply restarting the LND container seems to temporarily fix the broken pipes and the unresponsive gRPC calls, leads me to believe there is something that LND is struggling with.

LND might typically take 4GB of RAM. The machine is not disk, RAM or CPU limited.

Edit: I can confirm my issue looks very similar to #7924, including some local lncli calls getting stuck.

Reckless-Satoshi commented 1 year ago

I can confirm that even if it connects to all peers/channels, lncli addinvoice can still never return. So these 2 issues can happen independently or jointly.

Roasbeef commented 1 year ago

A wild guess, does LND connect to all peers at once asynchronously in no specific order? or is there a deterministic order or batches (e.g. 10 peers at a time) and if the full batch is unreachable it keeps retrying?

The order is random-ish, likely derived from node pubkey ordering.

However, I notice that after ~37 peers, it pauses. There are many broken pipes after 1-2 minutes, it keeps connecting to all peers successfully. I can send logs of these runs if needed.

Sounds like it maybe runs into some tor fd-like limit? I have a tor node myself with ~50 channels/peers, but haven't been able to repro anything like this. I wonder if it's some tor thing where it starts to shed connections that are lower in the priority re QoS ranking.

Roasbeef commented 1 year ago

lncli addinvoice can still never return. So these 2 issues can happen independently or jointly.

If this happens again, are you able to grab a profile? What we want here is a snapshot of all the active goroutines, then we get all the stack traces and can see what is actually blocking.

LND might typically take 4GB of RAM. The machine is not disk, RAM or CPU limited.

That sounds rather high...particularly for the 30-ish peer range. Along side the goroutine trace, can you already get a memory profile? This won't reveal any sensitive information about the node (just what is allocating all that memory).

Reckless-Satoshi commented 1 year ago

Sounds like it maybe runs into some tor fd-like limit? I have a tor node myself with ~50 channels/peers, but haven't been able to repro anything like this. I wonder if it's some tor thing where it starts to shed connections that are lower in the priority re QoS ranking.

The node has been running successfully for over 1 year with +100 channels/peers . Currently half of all channels have been closed, now it is down to ~70 channels with no improvement with respect to this issue.

If this happens again, are you able to grab a profile? What we want here is a snapshot of all the active goroutines, then we get all the stack traces and can see what is actually blocking.

It happens very frequently. To keep the node up I am restarting it 15-30 times a day. After each restart, there is a chance of some stability time before lncli addinvoice stops working. However, the fact that it might not connect to all peers on restart makes managing this node a real challenge.

Profile taken after ~30 mins running. Connected to all peers but eventually was not returning on many lncli commands https://pastebin.com/yeeBmHpB

Memory https://pastebin.com/ReSjfepE

At the moment there might be ~10-14 outgoing payments that are in_flight but whose timelock block height is already in the past. How can we deal with these? Are these safe to ignore and retry payment?

In my view the issues (unresponsive lncli commands, no connection to peers, and stuck in_flight payments) are likely to have a common cause. Given how some of these issues have shown before but they have only gotten more severe over time, it seems some sort of "node aging". What is recommended here from an operational standpoint? Should we simply create a new node and ditch the current one altogether or is there any chance this issue can be solved?

Roasbeef commented 1 year ago

@Reckless-Satoshi can you move that to a paste bin so it's easier to scroll down thru the comments here? Thx!

Roasbeef commented 1 year ago

Some sections that pop out:

14 @ 0x43a236 0x44b5af 0x44b586 0x468ee6 0x14bb979 0x14bb955 0x1618005 0x171ddc4 0x171943e 0x171658d 0x46d261
#   0x468ee5    sync.runtime_SemacquireRWMutexR+0x25                                    runtime/sema.go:82
#   0x14bb978   sync.(*RWMutex).RLock+0xf8                                      sync/rwmutex.go:71
#   0x14bb954   github.com/lightningnetwork/lnd/channeldb.(*ChannelGraph).HasChannelEdge+0xd4               github.com/lightningnetwork/lnd/channeldb/graph.go:1128
#   0x1618004   github.com/lightningnetwork/lnd/routing.(*ChannelRouter).IsStaleEdgePolicy+0x64             github.com/lightningnetwork/lnd/routing/router.go:2658
#   0x171ddc3   github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).handleChanUpdate+0x423       github.com/lightningnetwork/lnd/discovery/gossiper.go:2702
#   0x171943d   github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).processNetworkAnnouncement+0xfd  github.com/lightningnetwork/lnd/discovery/gossiper.go:2007
#   0x171658c   github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).handleNetworkMessages+0x22c      github.com/lightningnetwork/lnd/discovery/gossiper.go:1503

4 @ 0x43a236 0x44a4be 0x14825ea 0x14858df 0x1880358 0x1889fcc 0x1888718 0x1057967 0x105491a 0x185d9d1 0x105491a 0x185d04b 0x105491a 0x185d60a 0x105491a 0x185c858 0x105491a 0x10547be 0x18885d8 0xcaf84f 0xcb3f36 0xcad3b8 0x46d261
#   0x14825e9   github.com/lightningnetwork/lnd/invoices.(*InvoiceExpiryWatcher).AddInvoices+0x89               github.com/lightningnetwork/lnd/invoices/invoice_expiry_watcher.go:265
#   0x14858de   github.com/lightningnetwork/lnd/invoices.(*InvoiceRegistry).AddInvoice+0x31e                    github.com/lightningnetwork/lnd/invoices/invoiceregistry.go:605
#   0x1880357   github.com/lightningnetwork/lnd/lnrpc/invoicesrpc.AddInvoice+0x1477                     github.com/lightningnetwork/lnd/lnrpc/invoicesrpc/addinvoice.go:474
#   0x1889fcb   github.com/lightningnetwork/lnd/lnrpc/invoicesrpc.(*Server).AddHoldInvoice+0x36b                github.com/lightningnetwork/lnd/lnrpc/invoicesrpc/invoices_server.go:367
#   0x1888717   github.com/lightningnetwork/lnd/lnrpc/invoicesrpc._Invoices_AddHoldInvoice_Handler.func1+0x77           github.com/lightningnetwork/lnd/lnrpc/invoicesrpc/invoices_grpc.pb.go:225
#   0x1057966   github.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1+0x86         github.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/server_metrics.go:107
#   0x1054919   github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x39                    github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25
#   0x185d9d0   github.com/lightningnetwork/lnd/rpcperms.(*InterceptorChain).middlewareUnaryServerInterceptor.func1+0x110   github.com/lightningnetwork/lnd/rpcperms/interceptor.go:832
#   0x1054919   github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x39                    github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25
#   0x185d04a   github.com/lightningnetwork/lnd/rpcperms.(*InterceptorChain).MacaroonUnaryServerInterceptor.func1+0x8a      github.com/lightningnetwork/lnd/rpcperms/interceptor.go:689
#   0x1054919   github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x39                    github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25
#   0x185d609   github.com/lightningnetwork/lnd/rpcperms.(*InterceptorChain).rpcStateUnaryServerInterceptor.func1+0x109     github.com/lightningnetwork/lnd/rpcperms/interceptor.go:781
#   0x1054919   github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x39                    github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25
#   0x185c857   github.com/lightningnetwork/lnd/rpcperms.errorLogUnaryServerInterceptor.func1+0x57              github.com/lightningnetwork/lnd/rpcperms/interceptor.go:605
#   0x1054919   github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x39                    github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25
#   0x10547bd   github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1+0xbd                    github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:34
#   0x18885d7   github.com/lightningnetwork/lnd/lnrpc/invoicesrpc._Invoices_AddHoldInvoice_Handler+0x137            github.com/lightningnetwork/lnd/lnrpc/invoicesrpc/invoices_grpc.pb.go:227
#   0xcaf84e    google.golang.org/grpc.(*Server).processUnaryRPC+0xcee                              google.golang.org/grpc@v1.41.0/server.go:1279
#   0xcb3f35    google.golang.org/grpc.(*Server).handleStream+0xa35                             google.golang.org/grpc@v1.41.0/server.go:1608
#   0xcad3b7    google.golang.org/grpc.(*Server).serveStreams.func1.2+0x97

Looks like something is trying to grab a lock in the DB, but then that that blocks. Thanks this was super helpful, we'll look into it.

Reckless-Satoshi commented 1 year ago

can you move that to a paste bin

Done, sorry about that! Thank you guys for the help!

Roasbeef commented 1 year ago

Those stack traces were super helpful! We have a candidate fix here: https://github.com/lightningnetwork/lnd/issues/7928

Reckless-Satoshi commented 1 year ago

Those stack traces were super helpful! We have a candidate fix here: #7928

The PR seems to patch a bug introduced post v0.16.4 release. We are running into this issue on v0.16.4, but was also observed in v0.16.3 and prior.

Does it make sense to apply #7938 over the v0.16.4 tag and test?

Roasbeef commented 1 year ago

@Reckless-Satoshi yes if you're able to do so, we also plan to cut an rc2, as several other users reported that the issue was solved by the PR.

Reckless-Satoshi commented 1 year ago

@Reckless-Satoshi yes if you're able to do so, we also plan to cut an rc2, as several other users reported that the issue was solved by the PR.

Running rc2. Can we keep this issue open until the issue is confirmed solved? I will report all findings later, but on a quick glance commands like lncli addinvoice take very long to reply (+30 seconds) so I suspect they might still get fully stuck after some time.

Roasbeef commented 1 year ago

but on a quick glance commands like lncli addinvoice take very long to reply (+30 seconds) so I suspect they might still get fully stuck after some time.

Can you get a CPU profile of this?

Reckless-Satoshi commented 1 year ago

Reporting some findings here after 24h running v0.17.0.rc2 . The main two issues discussed in this thread persist (1) connection to all peers is randomly lost or never achieved at all at start up and 2) unresponsive RPC for some calls (addinvoice, settleinvoice, etc))

Here (Pastebin) is a profile taken after 12 hours running perfectly (but sluggish) and then RPC became unresponsive. I sent logs to @hieblmi as well.

I also sent logs of a node start up where the node never achieved connecting to all peers, but got stuck at ~12 with broken pipe errors. Similar to the ones I sent when opening this thread. I have observed it can still eventually connect to all peers several hours after. If this is a Tor issue, I am unsure why it is surfacing now as nothing has changed in the configuration.

Can you get a CPU profile of this? I am not familiar with Go, how can I take a CPU profile of LND?

The node still has a lot of in_flight outgoing payments that have existed for long time. How to deal with these? Any way to close them? Are these safe to ignore and re-attempt payment? Do they consume resources?

guggero commented 1 year ago

I am not familiar with Go, how can I take a CPU profile of LND?

You do it the same way you get a goroutine dump, but you use /debug/pprof/profile?seconds=30 (which will take a snapshot of 30 seconds).

Reckless-Satoshi commented 1 year ago

You do it the same way you get a goroutine dump, but you use /debug/pprof/profile?seconds=30 (which will take a snapshot of 30 seconds).

Here the CPU profile while addinvoice is unresponsive https://pastebin.com/0NznDKYE

Another piece of data that might be of interest. We have added another LND docker service on the same docker-compose orchestration. It is a perfect clone, same lnd.conf, uses the same Tor daemon, same Bitcoind, etc. This node does not seem to have the same problems: it starts up quickly with ease, it can connect to all peers (no broken pipes, at the same time when the other instance can't connect to peers) and it seems to generate invoices timely. I will report if this node starts to show some of the same symptoms.

Roasbeef commented 1 year ago

@Reckless-Satoshi can you get that goroutine profile again once the node has been running for longer? At that profile, it looks like you only have 5 or so peers there.

In terms of other configuration changes, you can also look into:

Increasing the bitcoind.txpollinginterval value, you can raise this to something like 5m
Look into increasing the number of rpc threads that bitcoind uses.

Reckless-Satoshi commented 1 year ago

The issue (hang lncli addinvoice) occured twice again today on the new node after two days working well since we replaced it. Here the CPU profile after it happened while all channels were active: https://pastebin.com/a4KuKXd9

Increasing the bitcoind.txpollinginterval value, you can raise this to something like 5m

Look into increasing the number of rpc threads that bitcoind uses.

On this new node, in addition of the fixes proposed by @roasbeef on the last comment, we disabled rpc-middleware and stopped running LIT and Loop to rule out any cause related to these tools. We also increased ulimit -n considerably in the machine and in the LND container. The issue still happens.

Reckless-Satoshi commented 1 year ago

Active goroutines after ~2h running, all peers connected and gRPC functional (but veeery slugish): https://pastebin.com/hV6QKSLK

guggero commented 1 year ago

Thanks for the CPU profile and goroutine dump. Could you please take another CPU profile that covers a longer time span? Perhaps 5 minutes? Just change seconds=30 to seconds=600.

Also, a couple of additional questions:

What is the maxmempool setting of the bitcoind node?
What is the numgraphsyncpeers setting of the lnd nodes?
What is the sync-freelist and db.bolt.nofreelistsync setting of the lnd nodes?
Do you see any errors/warnings in the bitcoind log mentioning http work queue depth exceeded?

Thanks a lot!

Reckless-Satoshi commented 1 year ago

What is the maxmempool setting of the bitcoind node?
maxmempool=6000
mempoolexpiry=700
The reason for the large mempool was to avoid the cases where outgoing onchain TXs are forgotten. I see how a maxmempool that big might have an impact. There might also be a correlation of LND issues and mempool getting larger. So after your message, we changed the values to:
maxmempool=768 
mempoolexpiry=400
With these values, RPC still becomes unresponsive (but it sort of seems more stable? hard to say, sometimes it last for a couple of days before getting stuck. If we turn on the old LND node (same orquestration), it is guaranteed that the new node's RPC will become unresponsive and connection to peers will drop. It is in fact now impossible to bring up the older node, it seems to never finish connecting to peers, and it crashes the newer LND node.
What is the numgraphsyncpeers setting of the lnd nodes?

9 (default)

What is the sync-freelist and db.bolt.nofreelistsync setting of the lnd nodes?

true (default)

Do you see any errors/warnings in the bitcoind log mentioning http work queue depth exceeded?

No hits for "queue" other than [http] creating work queue of depth 16

I have also inspected the disk (errors, latency, read/write speed), everything seems to be alright. We have increased min_htlc to 1000 Sats to avoid small holded payment parts. I've observed ~200 pending HTLCs when RPC becomes unresponsive, but it also becomes unresponsive with a smaller number. Not sure if there is any correlation.

Perhaps 5 minutes? Just change seconds=30 to seconds=600

5 minutes CPU profile on the new node once RPC unresponsive after ~3 days running https://pastebin.com/YxVSeRDW 10 minutes CPU profile on the new node once RPC unresponsive a fe minutes after restarting LND https://pastebin.com/VJ6SkmXk

On the new node I have also observed two times that lncli getinfo becomes unresponsive. Restarting always solves unresponsive RPC, yet that means the node spends a large part of the day doing restarts.

Roasbeef commented 1 year ago

Can you generate the svg version of those CPU profiles? It's much easier to interpret like that.

Looking at the prior goroutine dump, it looks like the goroutines are blocked on acquiring locks in the ChannelGraph. So the methods: HasChannelEdge, and UpdateEdgePolicy. This leads me to believe there's an issue in batch/scheduler.go.

Just to confirm, @Reckless-Satoshi you're not running any additional patches on top?

Digging a bit deeper, it looks like a few goroutines at the very bottom are actually blocked on opening a DB transaction:

1 @ 0x43a236 0x44b5af 0x44b586 0x468e86 0x489b65 0xa3a9f7 0xa3a9e2 0xa3a67d 0xa4d2bd 0xa4e074 0xa4e065 0x14a9a6b 0x14a99a1 0x489f02 0x14aa772 0x14aa742 0x46d261
#   0x468e85    sync.runtime_SemacquireMutex+0x25                   runtime/sema.go:77
#   0x489b64    sync.(*Mutex).lockSlow+0x164                        sync/mutex.go:171
#   0xa3a9f6    sync.(*Mutex).Lock+0x76                         sync/mutex.go:90
#   0xa3a9e1    go.etcd.io/bbolt.(*DB).beginRWTx+0x61                   go.etcd.io/bbolt@v1.3.7/db.go:755
#   0xa3a67c    go.etcd.io/bbolt.(*DB).Begin+0x1c                   go.etcd.io/bbolt@v1.3.7/db.go:697
#   0xa4d2bc    github.com/btcsuite/btcwallet/walletdb/bdb.(*db).beginTx+0x1c       github.com/btcsuite/btcwallet/walletdb@v1.4.0/bdb/db.go:323
#   0xa4e073    github.com/btcsuite/btcwallet/walletdb/bdb.(*db).BeginReadWriteTx+0x53  github.com/btcsuite/btcwallet/walletdb@v1.4.0/bdb/db.go:335
#   0xa4e064    github.com/btcsuite/btcwallet/walletdb/bdb.(*db).Update+0x44        github.com/btcsuite/btcwallet/walletdb@v1.4.0/bdb/db.go:417
#   0x14a9a6a   github.com/lightningnetwork/lnd/kvdb.Update+0x1ea           github.com/lightningnetwork/lnd/kvdb@v1.4.2/interface.go:16
#   0x14a99a0   github.com/lightningnetwork/lnd/batch.(*batch).run+0x120        github.com/lightningnetwork/lnd/batch/batch.go:54
#   0x489f01    sync.(*Once).doSlow+0xc1                        sync/once.go:74
#   0x14aa771   sync.(*Once).Do+0x51                            sync/once.go:65
#   0x14aa741   github.com/lightningnetwork/lnd/batch.(*batch).trigger+0x21     github.com/lightningnetwork/lnd/batch/batch.go:32

1 @ 0x43a236 0x44b5af 0x44b586 0x468e86 0x489b65 0xa3a9f7 0xa3a9e2 0xa3a67d 0xa4d2bd 0xa4e074 0xa4e065 0x14e715f 0x14e70ce 0x156c436 0x1611c53 0x1611a65 0x1610d29 0x160cb76 0x46d261
#   0x468e85    sync.runtime_SemacquireMutex+0x25                           runtime/sema.go:77
#   0x489b64    sync.(*Mutex).lockSlow+0x164                                sync/mutex.go:171
#   0xa3a9f6    sync.(*Mutex).Lock+0x76                                 sync/mutex.go:90
#   0xa3a9e1    go.etcd.io/bbolt.(*DB).beginRWTx+0x61                           go.etcd.io/bbolt@v1.3.7/db.go:755
#   0xa3a67c    go.etcd.io/bbolt.(*DB).Begin+0x1c                           go.etcd.io/bbolt@v1.3.7/db.go:697
#   0xa4d2bc    github.com/btcsuite/btcwallet/walletdb/bdb.(*db).beginTx+0x1c               github.com/btcsuite/btcwallet/walletdb@v1.4.0/bdb/db.go:323
#   0xa4e073    github.com/btcsuite/btcwallet/walletdb/bdb.(*db).BeginReadWriteTx+0x53          github.com/btcsuite/btcwallet/walletdb@v1.4.0/bdb/db.go:335
#   0xa4e064    github.com/btcsuite/btcwallet/walletdb/bdb.(*db).Update+0x44                github.com/btcsuite/btcwallet/walletdb@v1.4.0/bdb/db.go:417
#   0x14e715e   github.com/lightningnetwork/lnd/kvdb.Update+0x13e                   github.com/lightningnetwork/lnd/kvdb@v1.4.2/interface.go:16
#   0x14e70cd   github.com/lightningnetwork/lnd/channeldb.(*OpenChannel).AppendRemoteCommitChain+0xad   github.com/lightningnetwork/lnd/channeldb/channel.go:2553
#   0x156c435   github.com/lightningnetwork/lnd/lnwallet.(*LightningChannel).SignNextCommitment+0x1055  github.com/lightningnetwork/lnd/lnwallet/channel.go:4157
#   0x1611c52   github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).updateCommitTx+0x72       github.com/lightningnetwork/lnd/htlcswitch/link.go:2253
#   0x1611a64   github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).updateCommitTxOrFail+0x24 github.com/lightningnetwork/lnd/htlcswitch/link.go:2204
#   0x1610d28   github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).handleUpstreamMsg+0x1248  github.com/lightningnetwork/lnd/htlcswitch/link.go:2094
#   0x160cb75   github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).htlcManager+0x11f5        github.com/lightningnetwork/lnd/htlcswitch/link.go:1230

This also looks somewhat susppect, here we see the goroutine blocked on an fsync call:

1 @ 0x4c55c7 0x4c2d6a 0xa334ed 0xa45f57 0xa44a9a 0xa4b1a5 0xa4e13c 0x14ea1cd 0x14ea0ca 0x160b145 0x160b0fa 0x160afd7 0x160bd85 0x46d261
#   0x4c55c6    syscall.Syscall+0x26                                    syscall/syscall_linux.go:69
#   0x4c2d69    syscall.Fdatasync+0x29                                  syscall/zsyscall_linux_amd64.go:418
#   0xa334ec    go.etcd.io/bbolt.fdatasync+0x4c                             go.etcd.io/bbolt@v1.3.7/bolt_linux.go:9
#   0xa45f56    go.etcd.io/bbolt.(*Tx).write+0x236                          go.etcd.io/bbolt@v1.3.7/tx.go:473
#   0xa44a99    go.etcd.io/bbolt.(*Tx).Commit+0x239                         go.etcd.io/bbolt@v1.3.7/tx.go:196
#   0xa4b1a4    github.com/btcsuite/btcwallet/walletdb/bdb.(*transaction).Commit+0x24           github.com/btcsuite/btcwallet/walletdb@v1.4.0/bdb/db.go:101
#   0xa4e13b    github.com/btcsuite/btcwallet/walletdb/bdb.(*db).Update+0x11b               github.com/btcsuite/btcwallet/walletdb@v1.4.0/bdb/db.go:437
#   0x14ea1cc   github.com/lightningnetwork/lnd/kvdb.Update+0x12c                   github.com/lightningnetwork/lnd/kvdb@v1.4.2/interface.go:16
#   0x14ea0c9   github.com/lightningnetwork/lnd/channeldb.(*OpenChannel).RemoveFwdPkgs+0x29     github.com/lightningnetwork/lnd/channeldb/channel.go:3045
#   0x160b144   github.com/lightningnetwork/lnd/lnwallet.(*LightningChannel).RemoveFwdPkgs+0x104    github.com/lightningnetwork/lnd/lnwallet/channel.go:5657
#   0x160b0f9   github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).resolveFwdPkg+0xb9        github.com/lightningnetwork/lnd/htlcswitch/link.go:826
#   0x160afd6   github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).resolveFwdPkgs+0xf6       github.com/lightningnetwork/lnd/htlcswitch/link.go:803
#   0x160bd84   github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).htlcManager+0x404     github.com/lightningnetwork/lnd/htlcswitch/link.go:1084

Roasbeef commented 1 year ago

I've been polling my nodes for goroutine dumps for a few minutes, and never see an explicit block on fdatasync like that. Looking at all the goroutine dumps you've sent so far, each of them have this blocking goroutine at the very bottom of the stack trace. If the syscall is blocking indefinately for w/e reason, then just about everything else stops, as the syscall is made in the DB while the main mutex is held.

Roasbeef commented 1 year ago

@Reckless-Satoshi can you try to obtain a blocking profile? This can help us determine if the syscall is the issue or not, or generally where things are blocked on I/O. go tool pprof http://localhost:6060/debug/pprof/block

Reckless-Satoshi commented 1 year ago

go tool pprof http://localhost:6060/debug/pprof/block

I will take this one the next time the RPC becomes unresponsive.

Here the 10 min CPU profile as SVG: https://user-images.githubusercontent.com/90936742/267646868-b95dea24-c627-4ce6-a9d7-804df1ba1652.svg

megaptera21 commented 1 year ago

Hi @Reckless-Satoshi Will the old RoboSats node come online once again or is it dead now? Would like to close the channel in a coop close if possible. Thanks.

Reckless-Satoshi commented 1 year ago

Will the old RoboSats node come online once again or is it dead now? Would like to close the channel in a coop close if possible.

If we manage to bring it back online we will initiate collaborative closes for every channel. We'll try again in the following days.

@Roasbeef I have tried docker exec ... curl http://localhost:6060/debug/pprof/block?seconds=600 --output /save/path/block and go tool pprof -top /save/path/block several times in a similar fashion to the previous profiles I took, however I am always retrieving an empty set. I always tried when the RPC was unresponsive.

File: lnd
Type: delay
Time: Sep 17, 2023 at 6:21am (UTC)
Duration: 1mins, Total samples = 0 
Showing nodes accounting for 0, 0% of 0 total
      flat  flat%   sum%        cum   cum%

Maybe I am doing something wrong here?

guggero commented 1 year ago

That http://localhost:6060/debug/pprof/block needs to be turned on specifically, which is only available in the latest 0.17.0-rc4 version (probably not a good idea to upgrade to that just yet).

I think the output of /debug/pprof/goroutine?debug=2 should also help us already.

By the way, I have to ask about the disk storage again. You answered that both sync-freelist and db.bolt.nofreelistsync is true? That can't be, as they are the inverse of each other. Do you mean that both options are at their default values (meaning they are commented out in the config)?

What storage backend are you using? Some sort of network/cluster storage? What are the number IOPS allowed (if networked/provisioned storage)? I assume you're using bbolt and not Postgres as the storage backend for lnd?

Have you considered using a bitcoind instance per lnd instance? If changing the mempool size of bitcoind has an impact on how quickly lnd is getting locked up, maybe the bottleneck is actually the number of RPC threads on bitcoind.

Reckless-Satoshi commented 1 year ago

You answered that both sync-freelist and db.bolt.nofreelistsync is true? That can't be, as they are the inverse of each other. Do you mean that both options are at their default values (meaning they are commented out in the config)?

Sorry I see now I only replied to sync-freelist is true. In fact, we are using the default value for both, as these are not specified on our lnd.conf. I assume then db.bolt.nofreelistsync is false.

What storage backend are you using? Some sort of network/cluster storage? What are the number IOPS allowed (if networked/provisioned storage)? I assume you're using bbolt and not Postgres as the storage backend for lnd?

It's a ZFS pool of SSDs in the same machine. The LND directory is simply mounted in the docker container. We are using bbolt.

Have you considered using a bitcoind instance per lnd instance? If changing the mempool size of bitcoind has an impact on how quickly lnd is getting locked up, maybe the bottleneck is actually the number of RPC threads on bitcoind.

We increased the number of bitcoind RPC threads. In restrospect, changing the mempool size does not seem to have had much of an impact, but it's hard to say. We have been running only one LND instance for the last two weeks and still RPC becomes unresponsive. Previous LND instance was monitoring over 12000 onchain addresses, the new one just a few and RPC still hangs.

I think the output of /debug/pprof/goroutine?debug=2 should also help us already.

I will collect this one the next time the node hangs.

Reckless-Satoshi commented 1 year ago

In the next few days we will build a new ZFS pool with new and faster drives and move to the new file system only the LND directory. This way we can finally confirm / discard if disks are the issue.

Reckless-Satoshi commented 1 year ago

I think the output of /debug/pprof/goroutine?debug=2 should also help us already.

Here the goroutine?debug=2 while RPC is unresponsive. Splitting the pastebin into 4 parts as it exceed the 512Kb limit: Part 1 https://pastebin.com/V49iDzyG Part 2 https://pastebin.com/ZqcTsTFn Part 3 https://pastebin.com/ciNTvARf Part 4 https://pastebin.com/nkQHiVcU

guggero commented 1 year ago

Thank you! By the way, GitHub also allows uploads of text files, so I'm re-posting the full, combined file here again: goroutinedump.txt

The full goroutine dump was very helpful! It looks like we have a deadlock in the HODL invoice system:

--> This goroutine is trying to send on the hodlChannel, which is buffered but probably already has a message in it. This is holding the invoice registry main lock.
goroutine 1667988 [select, 32 minutes]:
github.com/lightningnetwork/lnd/invoices.(*InvoiceRegistry).notifyHodlSubscribers(0xc0005d83c0, {0x22ca860, 0xc015d33e20})
    github.com/lightningnetwork/lnd/invoices/invoiceregistry.go:1714 +0x265
github.com/lightningnetwork/lnd/invoices.(*InvoiceRegistry).cancelInvoiceImpl(0xc0005d83c0, {0x4a, 0xe, 0x35, 0x49, 0xd6, 0x77, 0xec, 0xdc, 0x86, ...}, ...)
    github.com/lightningnetwork/lnd/invoices/invoiceregistry.go:1371 +0x585
github.com/lightningnetwork/lnd/invoices.(*InvoiceRegistry).CancelInvoice(...)
    github.com/lightningnetwork/lnd/invoices/invoiceregistry.go:1293
github.com/lightningnetwork/lnd/lnrpc/invoicesrpc.(*Server).CancelInvoice(0xc003f0db40, {0x1cbd1c0?, 0x1dc2da0?}, 0x0?)
    github.com/lightningnetwork/lnd/lnrpc/invoicesrpc/invoices_server.go:309 +0xa5
.....

--> 108 goroutines are attempting to add invoices but are blocked on the main invoice registry lock:

goroutine 1690726 [sync.Mutex.Lock, 11 minutes]:
sync.runtime_SemacquireMutex(0x17ffffffffffffff?, 0x0?, 0x1f0000c017c0d050?)
    runtime/sema.go:77 +0x26
sync.(*Mutex).lockSlow(0xc0005d83c0)
    sync/mutex.go:171 +0x165
sync.(*Mutex).Lock(...)
    sync/mutex.go:90
sync.(*RWMutex).Lock(0xc017c0d0a8?)
    sync/rwmutex.go:147 +0x36
github.com/lightningnetwork/lnd/invoices.(*InvoiceRegistry).AddInvoice(0xc0005d83c0, 0xc0253b6c40, {0x4a, 0x4f, 0x96, 0x4a, 0x90, 0x63, 0x96, 0xf8, ...})
    github.com/lightningnetwork/lnd/invoices/invoiceregistry.go:584 +0x3d
github.com/lightningnetwork/lnd/lnrpc/invoicesrpc.AddInvoice({0x0?, 0x0?}, 0xc017c0d538, 0xc017c0d588)
    github.com/lightningnetwork/lnd/lnrpc/invoicesrpc/addinvoice.go:481 +0x1506
github.com/lightningnetwork/lnd/lnrpc/invoicesrpc.(*Server).AddHoldInvoice(0xc003f0db40, {0x22e0938, 0xc01060e3f0}, 0xc028838780)
    github.com/lightningnetwork/lnd/lnrpc/invoicesrpc/invoices_server.go:367 +0x36c

I'll see if I can come up with a fix. If there's any way for you to temporarily stop using HODL invoices then that should unclog the node. But I assume that's a main part of your application... So hopefully we can get a patch out soon.

Roasbeef commented 1 year ago

@Reckless-Satoshi worth noting that for the blocking/mutex profiles, you only want to enable them when you want a profile, then turn it off. Otherwise it adds extra load since it continues to collect samples unlike the other profiling options.

Roasbeef commented 1 year ago

@Reckless-Satoshi thanks for that last trace! That was exactly what we needed to get to the bottom of this. I think we have the root cause, and are discussing the best route to resolve the issue (have a few candidates of varying diff size).

Roasbeef commented 1 year ago

Summarized my investigation and current conclusions here: https://github.com/lightningnetwork/lnd/issues/8023.

Roasbeef commented 1 year ago

I have a candidate fix here, it applies on top of master: https://github.com/lightningnetwork/lnd/pull/8024

lightningnetwork / lnd