lightningnetwork / lnd

Lightning Network Daemon ⚡️
MIT License
7.58k stars 2.06k forks source link

lnd should exit with no-success error code when automatically shut down due to bitcoind problems #5625

Open Talkless opened 2 years ago

Talkless commented 2 years ago

Background

While troubleshooting tor & bitcoind issues, I've restarted bitcoind two times in a row, and discovered (just accidentally, as I have bitcoind and lnd logs tailed in the same tmux split screen) that lnd is shutting itself down:

Aug 12 12:55:25 odroid-hc1 lnd[5606]: 2021-08-12 12:55:25.120 [ERR] NTFN: Unable to fetch block header: Post "http://127.0.0.1:8332": dial tcp 127.0.0.1:8332: connect: connection refused
Aug 12 12:55:25 odroid-hc1 lnd[5606]: 2021-08-12 12:55:25.304 [INF] CRTR: Pruning channel graph using block 0000000000000000000c7620d7807edfad1c475f204f5c17da4601e2a6e13945 (height=695396)
Aug 12 12:55:37 odroid-hc1 lnd[5606]: 2021-08-12 12:55:37.500 [INF] DISC: GossipSyncer(02ad6fb8d693dc1e4569bcedefadf5f72a931ae027dc0f0c544b34c1c6f3b9a02b): applying gossipFilter(start=0001-01-01 00:00:00 +0000 UTC, end=0001-01-01 00:00:00 +0000 UTC)
Aug 12 12:55:37 odroid-hc1 lnd[5606]: 2021-08-12 12:55:37.501 [INF] DISC: GossipSyncer(036d2ac71176151db04fdac839a0ddea9f3a584f6c23bb0b4ac72c323124ec506b): applying gossipFilter(start=2021-08-12 12:55:37.501421413 +0300 EEST m=+1873248.185437149, end=2157-09-18 19:23:52.501421413 +0300 EEST)
Aug 12 12:56:21 odroid-hc1 lnd[5606]: 2021-08-12 12:56:21.460 [INF] HLCK: Health check: chain backend, call: 2 failed with: -28: Verifying blocks..., backing off for: 2m0s
Aug 12 12:56:37 odroid-hc1 lnd[5606]: 2021-08-12 12:56:37.501 [INF] DISC: Broadcasting 49 new announcements in 5 sub batches
Aug 12 12:58:21 odroid-hc1 lnd[5606]: 2021-08-12 12:58:21.476 [CRT] SRVR: Health check: chain backend failed after 3 calls
Aug 12 12:58:21 odroid-hc1 lnd[5606]: 2021-08-12 12:58:21.476 [INF] SRVR: Sending request for shutdown
Aug 12 12:58:21 odroid-hc1 lnd[5606]: 2021-08-12 12:58:21.516 [INF] LTND: Received shutdown request.
Aug 12 12:58:21 odroid-hc1 lnd[5606]: 2021-08-12 12:58:21.517 [INF] LTND: Shutting down...
Aug 12 12:58:21 odroid-hc1 lnd[5606]: 2021-08-12 12:58:21.517 [INF] LTND: Gracefully shutting down.
...
Aug 12 12:59:58 odroid-hc1 lnd[5606]: 2021-08-12 12:59:58.385 [INF] LTND: Shutdown complete
Aug 12 12:59:58 odroid-hc1 systemd[1]: lnd.service: Succeeded.

systemctl status now is:

   Loaded: loaded (/etc/systemd/system/lnd.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since Thu 2021-08-12 12:59:58 EEST; 9min ago
  Process: 5587 ExecStartPre=/usr/local/bin/lndcache.sh (code=exited, status=0/SUCCESS)
  Process: 5606 ExecStart=/home/lnd/lnd.go/bin/lnd --profile=9000 (code=exited, status=0/SUCCESS)
 Main PID: 5606 (code=exited, status=0/SUCCESS)

So it has SUCCESS status, meaning systemd will not restart lnd in case of this bail out, even if I have Restart=on-failure, as this was not reported as failure. This is risk for "silently" losing lightning functionality...

Your environment

Steps to reproduce

Keep restarting bitcoind until lnd shuts down with success result

Expected behaviour

lnd process should exit with non-zero result.

Actual behaviour

lnd exits with success result.

DarthBenro008 commented 2 years ago

Hey! I would like to work on this!

Talkless commented 2 years ago

Got another lnd auto-shutdown, this time I did nothing, i.e. bitcoind is still running:

Sep 20 13:34:12 odroid-hc1 lnd[3214]: 2021-09-20 13:34:12.812 [INF] HLCK: Health check: chain backend, call: 2 failed with: health check: chain backend timed
 out after: 30s, backing off for: 2m0s
...
Sep 20 13:36:42 odroid-hc1 lnd[3214]: 2021-09-20 13:36:42.859 [CRT] SRVR: Health check: chain backend failed after 3 calls
sangaman commented 2 years ago

This same thing happened to me recently, shutdown due to healthcheck chain backend failed after 3 calls. I'm also using systemd to manage lnd and my node was down for some time without my knowledge. My wish would probably be that lnd doesn't shut down in case the backend is lagging - rather it goes in an idle state and waits for the backend to come back online. However, having it shutdown with an error in case of healthcheck failure and using Restart=on-failure would be good enough for my needs. So +1 to this feature request.

Talkless commented 2 years ago

This issue repeats about 2-3 times per month: image

Talkless commented 2 years ago

Oh, I see I can configure health checks: https://github.com/lightningnetwork/lnd/blob/ad78ff114fd38dd392989849a50d4a000f1519d0/sample-lnd.conf#L940

Still, this issue stands, process should exit with non-successful code.

Talkless commented 1 year ago

Any progress? lnd "died" while I was not at home, probably due to bitcoind being loaded too much with huge mempool we have recently:

May 09 23:12:03 odroid-hc1 lnd[2542]: 2023-05-09 23:12:03.347 [CRT] SRVR: Health check: chain backend failed after 10 calls
May 09 23:12:03 odroid-hc1 lnd[2542]: 2023-05-09 23:12:03.356 [INF] SRVR: Sending request for shutdown