[bug]: New healthcheck checkOutboundPeers triggers downtime

Zetanova commented 1 week ago

Background

The new health check to test the min amount of btc node outbound peers is currently set to 6 https://github.com/lightningnetwork/lnd/blob/ce813276d55e07887c1889a745856ae759c5d39b/chainreg/chainregistry.go#L890C6-L890C24

With the default configuration bitcoind will connect to max 10 outbound peers. If they internet gets briefly interrupted then the bitcoind will not reconnect to 6 outbound peers fast enough, to not trigger the lnd shutdown.

This is only a assumption, a proper shutdown reason inside the log is missing. The message Health check: chain backend failed after 3 calls is insufficient. The bitcoind is running on the same host/docker fine and responsive.

Your environment

lnd v0.18.0-beta Linux b5c28fb40412 5.15.0-1050-raspi bitcoind v27.0.0

Steps to reproduce

Run lnd and bitcoind with near default configuration and disconnect the internet for 30sec-1min.

Expected behaviour

The default config values should not trigger on a short internet offline period.
The outbound and inbound count should be configurable
The reason why the chain backend failed should be printed in the log.

Actual behaviour

The lnd shutdowns itself as if it lost connection to the chain backend

guggero commented 1 week ago

The check you mention will only log a warning if there are not sufficient outbound peers. It will never cause a shutdown. So there must be something wrong with the call to the backend.

Zetanova commented 1 week ago

thanks, then I will investigate more

It happened multiple times since v0.18, before near never (0-2y)

guggero commented 1 week ago

Hmm, weird. The only thing I can imagine (that is new in 0.18) is that the call to GetPeerInfo itself fails for some reason. But I think any errors from that should be logged. What's your log level? You should see a message that starts with INF: Health check: that contains the actual error.

Zetanova commented 1 week ago

debuglevel=info,WTCL=warn

...
2024-06-15 21:39:40.807 [INF] HLCK: Health check: chain backend, call: 1 failed with: health check: chain backend timed out after: 30s, backing off for: 2m0s
2024-06-15 22:07:40.807 [INF] HLCK: Health check: chain backend, call: 1 failed with: health check: chain backend timed out after: 30s, backing off for: 2m0s
2024-06-15 22:12:40.808 [INF] HLCK: Health check: chain backend, call: 1 failed with: health check: chain backend timed out after: 30s, backing off for: 2m0s
2024-06-15 22:15:10.875 [INF] HLCK: Health check: chain backend, call: 2 failed with: health check: chain backend timed out after: 30s, backing off for: 2m0s
2024-06-15 22:17:40.881 [CRT] SRVR: Health check: chain backend failed after 3 calls

guggero commented 1 week ago

Okay, so the actual connection timed out. Did you update to bitcoind v0.27 at the same time as you updated to lnd v0.18.0? Maybe something in bitcoind changed in relation to being offline?

Zetanova commented 1 week ago

I updated the bitcoind only after lnd 0.18 and the problems started. The node itself is running already over 3y (started on 2x rsp3)

I monitoring system resources manually but all seams fine System has Free Memory and Low CPU

The bitcoind logs do not have any special errors/warning

Roasbeef commented 1 week ago

As mentioned above, we won't shutdown based on this message. Instead, it's a warning.

Zetanova commented 1 week ago

I tested now with a manual disconnect of 65sec outgoing peers decreased to 7 after the reconnect no special log entries

I will monitor longer and write if I find the issue.

The node itself never manage to stay longer then 120h online after the last update.

Zetanova commented 1 week ago

... It already shutdown yesterday the command docker logs lnd -f --tail 100 | grep "Health check" didn't print a nextline+cd

2024-06-26 16:41:41.012 [INF] HLCK: Health check: chain backend, call: 1 failed with: health check: chain backend timed out after: 30s, backing off for: 2m0s
2024-06-26 16:44:11.022 [INF] HLCK: Health check: chain backend, call: 2 failed with: health check: chain backend timed out after: 30s, backing off for: 2m0s
2024-06-26 18:17:40.994 [INF] HLCK: Health check: chain backend, call: 1 failed with: health check: chain backend timed out after: 30s, backing off for: 2m0s
2024-06-26 18:20:10.996 [INF] HLCK: Health check: chain backend, call: 2 failed with: health check: chain backend timed out after: 30s, backing off for: 2m0s
2024-06-26 18:22:41.014 [CRT] SRVR: Health check: chain backend failed after 3 calls

Zetanova commented 1 week ago

I found the issue, a now chatty process storagenode from storj suddenly burst in CPU usage over minutes.

thx for help

Zetanova commented 1 week ago

I investigated now a bit more, cpu limited the storj process and increased the bitcoind dbcache

What I discovered is that bitcoind uses high amount of cpu on a low block download at this period the $ time bitcoin-cli -netinfo is very lagy with probs: 1s, 11s, 29s

When the bitcoind does a tip verification the $ time bitcoin-cli -netinfo goes even over 1m with probs: 1m24s, 2m10s

@guggero Can the getPeers API call be the reason for the health check: chain backend timed out after: 30s message on the lnd?

guggero commented 1 week ago

Can the getPeers API call be the reason for the health check: chain backend timed out after: 30s message on the lnd?

That's what I suspect too. But if that call is slow on bitcoind, maybe it's a sign that you shouldn't cram even more things onto a piece of hardware that's barely able to keep up with just Bitcoin stuff? Maybe StorJ is completely hogging the disk as well, which is slowing down block verification.

Zetanova commented 1 week ago

I found the real issue, its the fragmentation level of the hdd seq read/write speed on the idle rsp4 are at ~95MB/s

bitcoind made 10MB/s read, but this was already the max in random access.

@guggero I know this is limited hardware, but is in the req. specs.

lightningnetwork / lnd