Closed Reckless-Satoshi closed 1 year ago
Some more context, maybe useful for debugging. RoboSats node is Tor-only. The Tor daemon runs as another container in the orchestration. Possibly these are the 127.0.0.1:***** broken pipes. However, all containers can talk to each other successfully, even LND container does: after every LND container kill and restart it does connect to some peers and get's synced to chain and graph. After some minutes, the broken pipes start showing.
gRPC requests from other containers work as usual. For example, one can create a new order in Robo and the invoice is created.
There is no RAM or disk limitations.
In the same machine, with identical orchestration, there is a tesnet full replicate. It is working as usual.
Hi, thanks for reporting the issue. Did this behavior occur the first time after you adjusted settings or did it start without intervention? Would you be able to provide lnd logs that contain the start up as well as showing the decay of active peers? Can also mail the logs if you don't want to attach them here.
. Did this behavior occur the first time after you adjusted settings or did it start without intervention?
No intervention. No one logged into the machine for a day prior to the issue, no setting was changed for at least +2 months.
Would you be able to provide lnd logs that contain the start up as well as showing the decay of active peers? Can also mail the logs if you don't want to attach them here.
Collecting the logs. Would be great if I can send them by email, where can I send them to?
hieblmi at lightning dot engineering
Sent logs of the latest restart. It connected to +12 channels. Currently only 7 show as active. "synced_to_chain" is false now.
Is the wallet ever unlocked, can't see that from the logs.
Is the wallet ever unlocked, can't see that from the logs.
Yes it is. Is the logs file ~1.68Mb ?
Edit: I believe logs level is set to WARN, will rerun in DEBUG
Sent new logs with Debug verbosity. This time it seems something did fail before any peer was connected.
Edit: sent another run where it connected to 2 channels before chain sync went false.
Can other services connect to the Tor daemon? If you restart the Tor daemon is the issue resolved? Looks like it isn't able to maintain SOCKs proxy connections.
If you check the logs of the tor daemon, do you see any relevant error logging?
If you used a normal CLI command to establish a TCP connection using the SOCKS proxy, does that work?
Hey thank you @Roasbeef
Can other services connect to the Tor daemon? If you restart the Tor daemon is the issue resolved? Looks like it isn't able to maintain SOCKs proxy connections.
Every other service can keep connection to the Tor Daemon. Restarting the Tor Daemon, or the whole machine, does not help. But restarting only the LND container will grant at least those initial 5-10 minutes without connection issues where it gets to sync the graph and establish connection to a few peers.
If you check the logs of the tor daemon, do you see any relevant error logging?
It's hard to say as I am not accostumed to look at Tor Daemon logs. There is certainly a huge amount of logs, some of which look like timeouts.
If you used a normal CLI command to establish a TCP connection using the SOCKS proxy, does that work?
From inside the LND container I run curl --socks5-hostname 127.0.0.1:9050 https://example.com
, it prints example.com content correctly. It does so even when LND is already uncapable of connecting to peers.
I also tried deleting ./lnd/v3_onion_private_key
. At next start up it created a new one. After some minutes, it failed to connect to all peers. So deleting v3_onion_private key
did not help.
What does your Tor related config for lnd look like?
write: broken pipe
In the Go runtime, this means that the other side has torn down or reset the TCP connection, but then a program/goroutine tries to write to it.
Also are you able to manually use lncli connect
to create outbound connections, either to other tor nodes or clearnet nodes?
Has this been going on for some time? Or did some recent update or config change trigger the behavior?
Good news and bad news. Good news the node managed connecting to all channels. Bad news, it was not thanks to anything I tried, it simply happened overnight and I do not know why it works now. I have restarted the node 3 times, every single time it connects to all peers.
However, I notice that after ~37 peers, it pauses. There are many broken pipes
after 1-2 minutes, it keeps connecting to all peers successfully. I can send logs of these runs if needed.
A wild guess, does LND connect to all peers at once asynchronously in no specific order? or is there a deterministic order or batches (e.g. 10 peers at a time) and if the full batch is unreachable it keeps retrying?
What does your Tor related config for lnd look like?
This is all there is in lnd.conf relating Tor.
tor.active=1
tor.v3=1
Also are you able to manually use lncli connect to create outbound connections, either to other tor nodes or clearnet nodes?
Unfortunately I cannot test now given that the node is back and running as usual. Running lncli connect
does work now, but I do not know if it had worked yesterday.
Has this been going on for some time? Or did some recent update or config change trigger the behavior?
This is the first time I observe this behavior. No recent change of settings or updates, in fact, I happened to be in holidays when it started to fail :`)
What I observe now is a few outgoing in_flight
payments that were stuck when the issue started, that are still in_flight
, yet, the first hop channels on which these are flying show to have 0 total HTLCs (outgoing / incoming) in node managers like Thunderhub. I must say had observe this particular thing happening 2 times in the last 8 months: payments that get stuck in_flight
forever, even weeks after the first hop channel was collaboratively closed. This is probably better suited for another issue, however there is quite a few of the same stuck payments at the moment.
It happened again and I have some more information on this case.
lncli connect <pubkey@host:port>
fails with rpc error: code unknown desc = write tcp 127.0.0.1:.....->127.0.0.1:9050: write:broken pipe
lncli addinvoice
never returns.There is some extra context that might be of interest:
The LND node of RoboSats has previously had other issues that I originally thought were unrelated to this one. For example, it sometimes cannot create new invoices (gRPC addinvoice or lncli addinvoice
would never return). The first time this happened was time ~6-8 months ago. Yet, a simple restart of the LND container fixed it instantly. This issue started happening a bit more often lately, maybe every 2-5 days. Sometimes even a few times in a day, but then it might go 2-3 weeks without any occurrence.
I did not observe an obvious relationship between the number of pending HTLCs or invoices and the likelihood of LND getting stuck (but there might be a relation!?). I did not report these earlier as I never saw something out of the ordinary in the logs (now I realize the verbosity was set to WARN), it simply stops replying to some RPC calls (e.g. addinvoice
, newaddress
, settle invoice
), it does not stop replying to lncli getinfo
, lncli lookupinvoice
or listinvoices
.
Again, the fact that simply restarting the LND container seems to temporarily fix the broken pipes
and the unresponsive gRPC calls, leads me to believe there is something that LND is struggling with.
LND might typically take 4GB of RAM. The machine is not disk, RAM or CPU limited.
Edit: I can confirm my issue looks very similar to #7924, including some local lncli
calls getting stuck.
I can confirm that even if it connects to all peers/channels, lncli addinvoice
can still never return. So these 2 issues can happen independently or jointly.
A wild guess, does LND connect to all peers at once asynchronously in no specific order? or is there a deterministic order or batches (e.g. 10 peers at a time) and if the full batch is unreachable it keeps retrying?
The order is random-ish, likely derived from node pubkey ordering.
However, I notice that after ~37 peers, it pauses. There are many broken pipes after 1-2 minutes, it keeps connecting to all peers successfully. I can send logs of these runs if needed.
Sounds like it maybe runs into some tor fd-like limit? I have a tor node myself with ~50 channels/peers, but haven't been able to repro anything like this. I wonder if it's some tor thing where it starts to shed connections that are lower in the priority re QoS ranking.
lncli addinvoice can still never return. So these 2 issues can happen independently or jointly.
If this happens again, are you able to grab a profile? What we want here is a snapshot of all the active goroutines, then we get all the stack traces and can see what is actually blocking.
LND might typically take 4GB of RAM. The machine is not disk, RAM or CPU limited.
That sounds rather high...particularly for the 30-ish peer range. Along side the goroutine trace, can you already get a memory profile? This won't reveal any sensitive information about the node (just what is allocating all that memory).
Sounds like it maybe runs into some tor fd-like limit? I have a tor node myself with ~50 channels/peers, but haven't been able to repro anything like this. I wonder if it's some tor thing where it starts to shed connections that are lower in the priority re QoS ranking.
The node has been running successfully for over 1 year with +100 channels/peers . Currently half of all channels have been closed, now it is down to ~70 channels with no improvement with respect to this issue.
If this happens again, are you able to grab a profile? What we want here is a snapshot of all the active goroutines, then we get all the stack traces and can see what is actually blocking.
It happens very frequently. To keep the node up I am restarting it 15-30 times a day. After each restart, there is a chance of some stability time before lncli addinvoice
stops working. However, the fact that it might not connect to all peers on restart makes managing this node a real challenge.
Profile taken after ~30 mins running. Connected to all peers but eventually was not returning on many lncli
commands
https://pastebin.com/yeeBmHpB
Memory https://pastebin.com/ReSjfepE
At the moment there might be ~10-14 outgoing payments that are in_flight
but whose timelock
block height is already in the past. How can we deal with these? Are these safe to ignore and retry payment?
In my view the issues (unresponsive lncli commands, no connection to peers, and stuck in_flight payments) are likely to have a common cause. Given how some of these issues have shown before but they have only gotten more severe over time, it seems some sort of "node aging". What is recommended here from an operational standpoint? Should we simply create a new node and ditch the current one altogether or is there any chance this issue can be solved?
@Reckless-Satoshi can you move that to a paste bin so it's easier to scroll down thru the comments here? Thx!
Some sections that pop out:
14 @ 0x43a236 0x44b5af 0x44b586 0x468ee6 0x14bb979 0x14bb955 0x1618005 0x171ddc4 0x171943e 0x171658d 0x46d261
# 0x468ee5 sync.runtime_SemacquireRWMutexR+0x25 runtime/sema.go:82
# 0x14bb978 sync.(*RWMutex).RLock+0xf8 sync/rwmutex.go:71
# 0x14bb954 github.com/lightningnetwork/lnd/channeldb.(*ChannelGraph).HasChannelEdge+0xd4 github.com/lightningnetwork/lnd/channeldb/graph.go:1128
# 0x1618004 github.com/lightningnetwork/lnd/routing.(*ChannelRouter).IsStaleEdgePolicy+0x64 github.com/lightningnetwork/lnd/routing/router.go:2658
# 0x171ddc3 github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).handleChanUpdate+0x423 github.com/lightningnetwork/lnd/discovery/gossiper.go:2702
# 0x171943d github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).processNetworkAnnouncement+0xfd github.com/lightningnetwork/lnd/discovery/gossiper.go:2007
# 0x171658c github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).handleNetworkMessages+0x22c github.com/lightningnetwork/lnd/discovery/gossiper.go:1503
4 @ 0x43a236 0x44a4be 0x14825ea 0x14858df 0x1880358 0x1889fcc 0x1888718 0x1057967 0x105491a 0x185d9d1 0x105491a 0x185d04b 0x105491a 0x185d60a 0x105491a 0x185c858 0x105491a 0x10547be 0x18885d8 0xcaf84f 0xcb3f36 0xcad3b8 0x46d261
# 0x14825e9 github.com/lightningnetwork/lnd/invoices.(*InvoiceExpiryWatcher).AddInvoices+0x89 github.com/lightningnetwork/lnd/invoices/invoice_expiry_watcher.go:265
# 0x14858de github.com/lightningnetwork/lnd/invoices.(*InvoiceRegistry).AddInvoice+0x31e github.com/lightningnetwork/lnd/invoices/invoiceregistry.go:605
# 0x1880357 github.com/lightningnetwork/lnd/lnrpc/invoicesrpc.AddInvoice+0x1477 github.com/lightningnetwork/lnd/lnrpc/invoicesrpc/addinvoice.go:474
# 0x1889fcb github.com/lightningnetwork/lnd/lnrpc/invoicesrpc.(*Server).AddHoldInvoice+0x36b github.com/lightningnetwork/lnd/lnrpc/invoicesrpc/invoices_server.go:367
# 0x1888717 github.com/lightningnetwork/lnd/lnrpc/invoicesrpc._Invoices_AddHoldInvoice_Handler.func1+0x77 github.com/lightningnetwork/lnd/lnrpc/invoicesrpc/invoices_grpc.pb.go:225
# 0x1057966 github.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1+0x86 github.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/server_metrics.go:107
# 0x1054919 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x39 github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25
# 0x185d9d0 github.com/lightningnetwork/lnd/rpcperms.(*InterceptorChain).middlewareUnaryServerInterceptor.func1+0x110 github.com/lightningnetwork/lnd/rpcperms/interceptor.go:832
# 0x1054919 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x39 github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25
# 0x185d04a github.com/lightningnetwork/lnd/rpcperms.(*InterceptorChain).MacaroonUnaryServerInterceptor.func1+0x8a github.com/lightningnetwork/lnd/rpcperms/interceptor.go:689
# 0x1054919 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x39 github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25
# 0x185d609 github.com/lightningnetwork/lnd/rpcperms.(*InterceptorChain).rpcStateUnaryServerInterceptor.func1+0x109 github.com/lightningnetwork/lnd/rpcperms/interceptor.go:781
# 0x1054919 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x39 github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25
# 0x185c857 github.com/lightningnetwork/lnd/rpcperms.errorLogUnaryServerInterceptor.func1+0x57 github.com/lightningnetwork/lnd/rpcperms/interceptor.go:605
# 0x1054919 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1+0x39 github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25
# 0x10547bd github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1+0xbd github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:34
# 0x18885d7 github.com/lightningnetwork/lnd/lnrpc/invoicesrpc._Invoices_AddHoldInvoice_Handler+0x137 github.com/lightningnetwork/lnd/lnrpc/invoicesrpc/invoices_grpc.pb.go:227
# 0xcaf84e google.golang.org/grpc.(*Server).processUnaryRPC+0xcee google.golang.org/grpc@v1.41.0/server.go:1279
# 0xcb3f35 google.golang.org/grpc.(*Server).handleStream+0xa35 google.golang.org/grpc@v1.41.0/server.go:1608
# 0xcad3b7 google.golang.org/grpc.(*Server).serveStreams.func1.2+0x97
Looks like something is trying to grab a lock in the DB, but then that that blocks. Thanks this was super helpful, we'll look into it.
can you move that to a paste bin
Done, sorry about that! Thank you guys for the help!
Those stack traces were super helpful! We have a candidate fix here: https://github.com/lightningnetwork/lnd/issues/7928
Those stack traces were super helpful! We have a candidate fix here: #7928
The PR seems to patch a bug introduced post v0.16.4 release. We are running into this issue on v0.16.4, but was also observed in v0.16.3 and prior.
Does it make sense to apply #7938 over the v0.16.4 tag and test?
@Reckless-Satoshi yes if you're able to do so, we also plan to cut an rc2, as several other users reported that the issue was solved by the PR.
@Reckless-Satoshi yes if you're able to do so, we also plan to cut an rc2, as several other users reported that the issue was solved by the PR.
Running rc2. Can we keep this issue open until the issue is confirmed solved? I will report all findings later, but on a quick glance commands like lncli addinvoice
take very long to reply (+30 seconds) so I suspect they might still get fully stuck after some time.
but on a quick glance commands like lncli addinvoice take very long to reply (+30 seconds) so I suspect they might still get fully stuck after some time.
Can you get a CPU profile of this?
Reporting some findings here after 24h running v0.17.0.rc2 . The main two issues discussed in this thread persist (1) connection to all peers is randomly lost or never achieved at all at start up and 2) unresponsive RPC for some calls (addinvoice
, settleinvoice
, etc))
Here (Pastebin) is a profile taken after 12 hours running perfectly (but sluggish) and then RPC became unresponsive. I sent logs to @hieblmi as well.
I also sent logs of a node start up where the node never achieved connecting to all peers, but got stuck at ~12 with broken pipe
errors. Similar to the ones I sent when opening this thread. I have observed it can still eventually connect to all peers several hours after. If this is a Tor issue, I am unsure why it is surfacing now as nothing has changed in the configuration.
Can you get a CPU profile of this? I am not familiar with Go, how can I take a CPU profile of LND?
The node still has a lot of in_flight
outgoing payments that have existed for long time. How to deal with these? Any way to close them? Are these safe to ignore and re-attempt payment? Do they consume resources?
I am not familiar with Go, how can I take a CPU profile of LND?
You do it the same way you get a goroutine dump, but you use /debug/pprof/profile?seconds=30
(which will take a snapshot of 30 seconds).
You do it the same way you get a goroutine dump, but you use
/debug/pprof/profile?seconds=30
(which will take a snapshot of 30 seconds).
Here the CPU profile while addinvoice
is unresponsive https://pastebin.com/0NznDKYE
Another piece of data that might be of interest. We have added another LND docker service on the same docker-compose orchestration. It is a perfect clone, same lnd.conf
, uses the same Tor daemon, same Bitcoind, etc. This node does not seem to have the same problems: it starts up quickly with ease, it can connect to all peers (no broken pipes
, at the same time when the other instance can't connect to peers) and it seems to generate invoices timely. I will report if this node starts to show some of the same symptoms.
@Reckless-Satoshi can you get that goroutine profile again once the node has been running for longer? At that profile, it looks like you only have 5 or so peers there.
In terms of other configuration changes, you can also look into:
bitcoind.txpollinginterval
value, you can raise this to something like 5m
The issue (hang lncli addinvoice
) occured twice again today on the new node after two days working well since we replaced it. Here the CPU profile after it happened while all channels were active
: https://pastebin.com/a4KuKXd9
Increasing the
bitcoind.txpollinginterval
value, you can raise this to something like5m
Look into increasing the number of rpc threads that bitcoind uses.
On this new node, in addition of the fixes proposed by @roasbeef on the last comment, we disabled rpc-middleware
and stopped running LIT and Loop to rule out any cause related to these tools. We also increased ulimit -n
considerably in the machine and in the LND container. The issue still happens.
Active goroutines after ~2h running, all peers connected and gRPC functional (but veeery slugish): https://pastebin.com/hV6QKSLK
Thanks for the CPU profile and goroutine dump. Could you please take another CPU profile that covers a longer time span? Perhaps 5 minutes? Just change seconds=30
to seconds=600
.
Also, a couple of additional questions:
maxmempool
setting of the bitcoind
node?numgraphsyncpeers
setting of the lnd
nodes?sync-freelist
and db.bolt.nofreelistsync
setting of the lnd
nodes?bitcoind
log mentioning http work queue depth exceeded
?Thanks a lot!
- What is the
maxmempool
setting of thebitcoind
node?maxmempool=6000 mempoolexpiry=700
The reason for the large mempool was to avoid the cases where outgoing onchain TXs are forgotten. I see how a maxmempool that big might have an impact. There might also be a correlation of LND issues and mempool getting larger. So after your message, we changed the values to:
maxmempool=768 mempoolexpiry=400
With these values, RPC still becomes unresponsive (but it sort of seems more stable? hard to say, sometimes it last for a couple of days before getting stuck. If we turn on the old LND node (same orquestration), it is guaranteed that the new node's RPC will become unresponsive and connection to peers will drop. It is in fact now impossible to bring up the older node, it seems to never finish connecting to peers, and it crashes the newer LND node.
- What is the
numgraphsyncpeers
setting of thelnd
nodes?
9 (default)
- What is the
sync-freelist
anddb.bolt.nofreelistsync
setting of thelnd
nodes?
true (default)
- Do you see any errors/warnings in the
bitcoind
log mentioninghttp work queue depth exceeded
?
No hits for "queue" other than [http] creating work queue of depth 16
I have also inspected the disk (errors, latency, read/write speed), everything seems to be alright.
We have increased min_htlc
to 1000 Sats to avoid small holded payment parts. I've observed ~200 pending HTLCs when RPC becomes unresponsive, but it also becomes unresponsive with a smaller number. Not sure if there is any correlation.
Perhaps 5 minutes? Just change
seconds=30
toseconds=600
5 minutes CPU profile on the new node once RPC unresponsive after ~3 days running https://pastebin.com/YxVSeRDW 10 minutes CPU profile on the new node once RPC unresponsive a fe minutes after restarting LND https://pastebin.com/VJ6SkmXk
On the new node I have also observed two times that lncli getinfo
becomes unresponsive. Restarting always solves unresponsive RPC, yet that means the node spends a large part of the day doing restarts.
Can you generate the svg
version of those CPU profiles? It's much easier to interpret like that.
Looking at the prior goroutine dump, it looks like the goroutines are blocked on acquiring locks in the ChannelGraph
. So the methods: HasChannelEdge
, and UpdateEdgePolicy
. This leads me to believe there's an issue in batch/scheduler.go
.
Just to confirm, @Reckless-Satoshi you're not running any additional patches on top?
Digging a bit deeper, it looks like a few goroutines at the very bottom are actually blocked on opening a DB transaction:
1 @ 0x43a236 0x44b5af 0x44b586 0x468e86 0x489b65 0xa3a9f7 0xa3a9e2 0xa3a67d 0xa4d2bd 0xa4e074 0xa4e065 0x14a9a6b 0x14a99a1 0x489f02 0x14aa772 0x14aa742 0x46d261
# 0x468e85 sync.runtime_SemacquireMutex+0x25 runtime/sema.go:77
# 0x489b64 sync.(*Mutex).lockSlow+0x164 sync/mutex.go:171
# 0xa3a9f6 sync.(*Mutex).Lock+0x76 sync/mutex.go:90
# 0xa3a9e1 go.etcd.io/bbolt.(*DB).beginRWTx+0x61 go.etcd.io/bbolt@v1.3.7/db.go:755
# 0xa3a67c go.etcd.io/bbolt.(*DB).Begin+0x1c go.etcd.io/bbolt@v1.3.7/db.go:697
# 0xa4d2bc github.com/btcsuite/btcwallet/walletdb/bdb.(*db).beginTx+0x1c github.com/btcsuite/btcwallet/walletdb@v1.4.0/bdb/db.go:323
# 0xa4e073 github.com/btcsuite/btcwallet/walletdb/bdb.(*db).BeginReadWriteTx+0x53 github.com/btcsuite/btcwallet/walletdb@v1.4.0/bdb/db.go:335
# 0xa4e064 github.com/btcsuite/btcwallet/walletdb/bdb.(*db).Update+0x44 github.com/btcsuite/btcwallet/walletdb@v1.4.0/bdb/db.go:417
# 0x14a9a6a github.com/lightningnetwork/lnd/kvdb.Update+0x1ea github.com/lightningnetwork/lnd/kvdb@v1.4.2/interface.go:16
# 0x14a99a0 github.com/lightningnetwork/lnd/batch.(*batch).run+0x120 github.com/lightningnetwork/lnd/batch/batch.go:54
# 0x489f01 sync.(*Once).doSlow+0xc1 sync/once.go:74
# 0x14aa771 sync.(*Once).Do+0x51 sync/once.go:65
# 0x14aa741 github.com/lightningnetwork/lnd/batch.(*batch).trigger+0x21 github.com/lightningnetwork/lnd/batch/batch.go:32
1 @ 0x43a236 0x44b5af 0x44b586 0x468e86 0x489b65 0xa3a9f7 0xa3a9e2 0xa3a67d 0xa4d2bd 0xa4e074 0xa4e065 0x14e715f 0x14e70ce 0x156c436 0x1611c53 0x1611a65 0x1610d29 0x160cb76 0x46d261
# 0x468e85 sync.runtime_SemacquireMutex+0x25 runtime/sema.go:77
# 0x489b64 sync.(*Mutex).lockSlow+0x164 sync/mutex.go:171
# 0xa3a9f6 sync.(*Mutex).Lock+0x76 sync/mutex.go:90
# 0xa3a9e1 go.etcd.io/bbolt.(*DB).beginRWTx+0x61 go.etcd.io/bbolt@v1.3.7/db.go:755
# 0xa3a67c go.etcd.io/bbolt.(*DB).Begin+0x1c go.etcd.io/bbolt@v1.3.7/db.go:697
# 0xa4d2bc github.com/btcsuite/btcwallet/walletdb/bdb.(*db).beginTx+0x1c github.com/btcsuite/btcwallet/walletdb@v1.4.0/bdb/db.go:323
# 0xa4e073 github.com/btcsuite/btcwallet/walletdb/bdb.(*db).BeginReadWriteTx+0x53 github.com/btcsuite/btcwallet/walletdb@v1.4.0/bdb/db.go:335
# 0xa4e064 github.com/btcsuite/btcwallet/walletdb/bdb.(*db).Update+0x44 github.com/btcsuite/btcwallet/walletdb@v1.4.0/bdb/db.go:417
# 0x14e715e github.com/lightningnetwork/lnd/kvdb.Update+0x13e github.com/lightningnetwork/lnd/kvdb@v1.4.2/interface.go:16
# 0x14e70cd github.com/lightningnetwork/lnd/channeldb.(*OpenChannel).AppendRemoteCommitChain+0xad github.com/lightningnetwork/lnd/channeldb/channel.go:2553
# 0x156c435 github.com/lightningnetwork/lnd/lnwallet.(*LightningChannel).SignNextCommitment+0x1055 github.com/lightningnetwork/lnd/lnwallet/channel.go:4157
# 0x1611c52 github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).updateCommitTx+0x72 github.com/lightningnetwork/lnd/htlcswitch/link.go:2253
# 0x1611a64 github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).updateCommitTxOrFail+0x24 github.com/lightningnetwork/lnd/htlcswitch/link.go:2204
# 0x1610d28 github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).handleUpstreamMsg+0x1248 github.com/lightningnetwork/lnd/htlcswitch/link.go:2094
# 0x160cb75 github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).htlcManager+0x11f5 github.com/lightningnetwork/lnd/htlcswitch/link.go:1230
This also looks somewhat susppect, here we see the goroutine blocked on an fsync
call:
1 @ 0x4c55c7 0x4c2d6a 0xa334ed 0xa45f57 0xa44a9a 0xa4b1a5 0xa4e13c 0x14ea1cd 0x14ea0ca 0x160b145 0x160b0fa 0x160afd7 0x160bd85 0x46d261
# 0x4c55c6 syscall.Syscall+0x26 syscall/syscall_linux.go:69
# 0x4c2d69 syscall.Fdatasync+0x29 syscall/zsyscall_linux_amd64.go:418
# 0xa334ec go.etcd.io/bbolt.fdatasync+0x4c go.etcd.io/bbolt@v1.3.7/bolt_linux.go:9
# 0xa45f56 go.etcd.io/bbolt.(*Tx).write+0x236 go.etcd.io/bbolt@v1.3.7/tx.go:473
# 0xa44a99 go.etcd.io/bbolt.(*Tx).Commit+0x239 go.etcd.io/bbolt@v1.3.7/tx.go:196
# 0xa4b1a4 github.com/btcsuite/btcwallet/walletdb/bdb.(*transaction).Commit+0x24 github.com/btcsuite/btcwallet/walletdb@v1.4.0/bdb/db.go:101
# 0xa4e13b github.com/btcsuite/btcwallet/walletdb/bdb.(*db).Update+0x11b github.com/btcsuite/btcwallet/walletdb@v1.4.0/bdb/db.go:437
# 0x14ea1cc github.com/lightningnetwork/lnd/kvdb.Update+0x12c github.com/lightningnetwork/lnd/kvdb@v1.4.2/interface.go:16
# 0x14ea0c9 github.com/lightningnetwork/lnd/channeldb.(*OpenChannel).RemoveFwdPkgs+0x29 github.com/lightningnetwork/lnd/channeldb/channel.go:3045
# 0x160b144 github.com/lightningnetwork/lnd/lnwallet.(*LightningChannel).RemoveFwdPkgs+0x104 github.com/lightningnetwork/lnd/lnwallet/channel.go:5657
# 0x160b0f9 github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).resolveFwdPkg+0xb9 github.com/lightningnetwork/lnd/htlcswitch/link.go:826
# 0x160afd6 github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).resolveFwdPkgs+0xf6 github.com/lightningnetwork/lnd/htlcswitch/link.go:803
# 0x160bd84 github.com/lightningnetwork/lnd/htlcswitch.(*channelLink).htlcManager+0x404 github.com/lightningnetwork/lnd/htlcswitch/link.go:1084
I've been polling my nodes for goroutine dumps for a few minutes, and never see an explicit block on fdatasync
like that. Looking at all the goroutine dumps you've sent so far, each of them have this blocking goroutine at the very bottom of the stack trace. If the syscall is blocking indefinately for w/e reason, then just about everything else stops, as the syscall is made in the DB while the main mutex is held.
@Reckless-Satoshi can you try to obtain a blocking profile? This can help us determine if the syscall is the issue or not, or generally where things are blocked on I/O. go tool pprof http://localhost:6060/debug/pprof/block
go tool pprof http://localhost:6060/debug/pprof/block
I will take this one the next time the RPC becomes unresponsive.
Here the 10 min CPU profile as SVG: https://user-images.githubusercontent.com/90936742/267646868-b95dea24-c627-4ce6-a9d7-804df1ba1652.svg
Hi @Reckless-Satoshi Will the old RoboSats node come online once again or is it dead now? Would like to close the channel in a coop close if possible. Thanks.
Will the old RoboSats node come online once again or is it dead now? Would like to close the channel in a coop close if possible.
If we manage to bring it back online we will initiate collaborative closes for every channel. We'll try again in the following days.
@Roasbeef I have tried docker exec ... curl http://localhost:6060/debug/pprof/block?seconds=600 --output /save/path/block
and go tool pprof -top /save/path/block
several times in a similar fashion to the previous profiles I took, however I am always retrieving an empty set. I always tried when the RPC was unresponsive.
File: lnd
Type: delay
Time: Sep 17, 2023 at 6:21am (UTC)
Duration: 1mins, Total samples = 0
Showing nodes accounting for 0, 0% of 0 total
flat flat% sum% cum cum%
Maybe I am doing something wrong here?
That http://localhost:6060/debug/pprof/block
needs to be turned on specifically, which is only available in the latest 0.17.0-rc4 version (probably not a good idea to upgrade to that just yet).
I think the output of /debug/pprof/goroutine?debug=2
should also help us already.
By the way, I have to ask about the disk storage again.
You answered that both sync-freelist
and db.bolt.nofreelistsync
is true? That can't be, as they are the inverse of each other. Do you mean that both options are at their default values (meaning they are commented out in the config)?
What storage backend are you using? Some sort of network/cluster storage? What are the number IOPS allowed (if networked/provisioned storage)? I assume you're using bbolt and not Postgres as the storage backend for lnd
?
Have you considered using a bitcoind instance per lnd
instance? If changing the mempool size of bitcoind
has an impact on how quickly lnd
is getting locked up, maybe the bottleneck is actually the number of RPC threads on bitcoind
.
You answered that both
sync-freelist
anddb.bolt.nofreelistsync
is true? That can't be, as they are the inverse of each other. Do you mean that both options are at their default values (meaning they are commented out in the config)?
Sorry I see now I only replied to sync-freelist
is true
. In fact, we are using the default value for both, as these are not specified on our lnd.conf
. I assume then db.bolt.nofreelistsync
is false
.
What storage backend are you using? Some sort of network/cluster storage? What are the number IOPS allowed (if networked/provisioned storage)? I assume you're using bbolt and not Postgres as the storage backend for lnd?
It's a ZFS pool of SSDs in the same machine. The LND directory is simply mounted in the docker container. We are using bbolt.
Have you considered using a bitcoind instance per lnd instance? If changing the mempool size of bitcoind has an impact on how quickly lnd is getting locked up, maybe the bottleneck is actually the number of RPC threads on bitcoind.
We increased the number of bitcoind RPC threads. In restrospect, changing the mempool size does not seem to have had much of an impact, but it's hard to say. We have been running only one LND instance for the last two weeks and still RPC becomes unresponsive. Previous LND instance was monitoring over 12000 onchain addresses, the new one just a few and RPC still hangs.
I think the output of /debug/pprof/goroutine?debug=2 should also help us already.
I will collect this one the next time the node hangs.
In the next few days we will build a new ZFS pool with new and faster drives and move to the new file system only the LND directory. This way we can finally confirm / discard if disks are the issue.
I think the output of
/debug/pprof/goroutine?debug=2
should also help us already.
Here the goroutine?debug=2
while RPC is unresponsive. Splitting the pastebin into 4 parts as it exceed the 512Kb limit:
Part 1 https://pastebin.com/V49iDzyG
Part 2 https://pastebin.com/ZqcTsTFn
Part 3 https://pastebin.com/ciNTvARf
Part 4 https://pastebin.com/nkQHiVcU
Thank you! By the way, GitHub also allows uploads of text files, so I'm re-posting the full, combined file here again: goroutinedump.txt
The full goroutine dump was very helpful! It looks like we have a deadlock in the HODL invoice system:
--> This goroutine is trying to send on the hodlChannel, which is buffered but probably already has a message in it. This is holding the invoice registry main lock.
goroutine 1667988 [select, 32 minutes]:
github.com/lightningnetwork/lnd/invoices.(*InvoiceRegistry).notifyHodlSubscribers(0xc0005d83c0, {0x22ca860, 0xc015d33e20})
github.com/lightningnetwork/lnd/invoices/invoiceregistry.go:1714 +0x265
github.com/lightningnetwork/lnd/invoices.(*InvoiceRegistry).cancelInvoiceImpl(0xc0005d83c0, {0x4a, 0xe, 0x35, 0x49, 0xd6, 0x77, 0xec, 0xdc, 0x86, ...}, ...)
github.com/lightningnetwork/lnd/invoices/invoiceregistry.go:1371 +0x585
github.com/lightningnetwork/lnd/invoices.(*InvoiceRegistry).CancelInvoice(...)
github.com/lightningnetwork/lnd/invoices/invoiceregistry.go:1293
github.com/lightningnetwork/lnd/lnrpc/invoicesrpc.(*Server).CancelInvoice(0xc003f0db40, {0x1cbd1c0?, 0x1dc2da0?}, 0x0?)
github.com/lightningnetwork/lnd/lnrpc/invoicesrpc/invoices_server.go:309 +0xa5
.....
--> 108 goroutines are attempting to add invoices but are blocked on the main invoice registry lock:
goroutine 1690726 [sync.Mutex.Lock, 11 minutes]:
sync.runtime_SemacquireMutex(0x17ffffffffffffff?, 0x0?, 0x1f0000c017c0d050?)
runtime/sema.go:77 +0x26
sync.(*Mutex).lockSlow(0xc0005d83c0)
sync/mutex.go:171 +0x165
sync.(*Mutex).Lock(...)
sync/mutex.go:90
sync.(*RWMutex).Lock(0xc017c0d0a8?)
sync/rwmutex.go:147 +0x36
github.com/lightningnetwork/lnd/invoices.(*InvoiceRegistry).AddInvoice(0xc0005d83c0, 0xc0253b6c40, {0x4a, 0x4f, 0x96, 0x4a, 0x90, 0x63, 0x96, 0xf8, ...})
github.com/lightningnetwork/lnd/invoices/invoiceregistry.go:584 +0x3d
github.com/lightningnetwork/lnd/lnrpc/invoicesrpc.AddInvoice({0x0?, 0x0?}, 0xc017c0d538, 0xc017c0d588)
github.com/lightningnetwork/lnd/lnrpc/invoicesrpc/addinvoice.go:481 +0x1506
github.com/lightningnetwork/lnd/lnrpc/invoicesrpc.(*Server).AddHoldInvoice(0xc003f0db40, {0x22e0938, 0xc01060e3f0}, 0xc028838780)
github.com/lightningnetwork/lnd/lnrpc/invoicesrpc/invoices_server.go:367 +0x36c
I'll see if I can come up with a fix. If there's any way for you to temporarily stop using HODL invoices then that should unclog the node. But I assume that's a main part of your application... So hopefully we can get a patch out soon.
@Reckless-Satoshi worth noting that for the blocking/mutex profiles, you only want to enable them when you want a profile, then turn it off. Otherwise it adds extra load since it continues to collect samples unlike the other profiling options.
@Reckless-Satoshi thanks for that last trace! That was exactly what we needed to get to the bottom of this. I think we have the root cause, and are discussing the best route to resolve the issue (have a few candidates of varying diff size).
Summarized my investigation and current conclusions here: https://github.com/lightningnetwork/lnd/issues/8023.
I have a candidate fix here, it applies on top of master: https://github.com/lightningnetwork/lnd/pull/8024
Background
On RoboSats mainnet node, on start up, LND gets to both synced to chain and graph. Connects to a few peers/ channels. Eventually, stops during its way to connect to all peers and channels. It might connect to only 2-30 peers/channels of 140 total channels. Eventually, the number of active channels decays down to 0. Effectively making the node dead. After some time,
synced_to_chain
goes false.This the systematic behaviour everytime LND has been restarted for the last 12 hour.
As RoboSats uses mainly hodl invoices, in a few hours many force closures will be triggered and recovery will be painful. Hopefully we can get the node back online first :D
The only interesting thing I see in the logs is a bunch of this (when LND loses chain sync?). But I can grep it if you think I am missing something.
There is one of these every few miliseconds.
I did already fully reboot the machine to no avail. Also updated docker / docker-compose to latest stable.
Your environment
lnd
: v0.16.4 Official Docker imageuname -a
on *Nix) Dockerized on Linux 5.4.0-156-generic #173-Ubuntu SMP Tue Jul 11 07:25:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linuxbtcd
,bitcoind
, or other backend Bitcoin Core 24.0.1Steps to reproduce
I will probably be unable to reproduce anything that might have happen to arrive to this state.
Expected behaviour
LND should start and connect to all peers/channels.