lightningnetwork / lnd

Lightning Network Daemon ⚡️
MIT License
7.64k stars 2.07k forks source link

LND shuts down if HTTP request to BTC node fails #5661

Open 3nprob opened 3 years ago

3nprob commented 3 years ago

Background

If any HTTP request to bitcoind fails during sync, the process shuts down. This is true also for intermittent errors like timeouts, networks errors, or bitcoind still syncing (related to #1533). Even under good conditions, when connecting to a bitcoin node over Tor, these kind of errors are normal.

I didn't find a clean way to address this in the LND codebase itself. I attempted to address this with this PR in the btcd/btcwallet codebase here: https://github.com/btcsuite/btcd/pull/1743

Have been running with this patch on testnet for some time and restarts have gone from multiple times per day to 0 over the past week.

Expected behaviour

LND retries failed requests with an exponential backoff

Actual behaviour

LND exits, requiring a wallet unlock on restart

guggero commented 3 years ago

Are you sure this isn't the healthcheck that is shutting down lnd? What do you see in the logs? You should be able to turn that off with healthcheck.chainbackend.attempts=0. Check lnd --help | grep health for more information.

3nprob commented 3 years ago

@guggero Yeah, I see in stderr (as opposed to stdout where other logs end up) a Connection failed (or equivalent; I don't have the logs anymore. There was no log enrichment from lnd here, just an error bubbling up all the way from the http client in the btcd module). In my speciic case tracked it down to: https://github.com/lightningnetwork/lnd/blob/93d12cd9fc34d42f37ebd573e6f39097b833aeed/lnwallet/btcwallet/blockchain.go#L156

During server.Start()

During startup (either initial or starting up after falling behind), this gets called for each block so probability of this causing a shutdown compounds with the number of blocks to process.

But looking around it seems that there's currently nothing in place to handle errors from calling any RPC functions so anything unexpected in that whole path will just bubble up and stop the process.. Or am I missing something?

FWIW lnd --help | grep health returns nothing for me - is there some build tag I need to enable that? FWIW I didn't touch the config around that from the sample at some point:

[healthcheck]
healthcheck.chainbackend.attempts=3
healthcheck.chainbackend.timeout=10s
healthcheck.chainbackend.backoff=30s
healthcheck.chainbackend.interval=2m

I actually missed this healthcheck-backoff thing - correct me if I'm wrong here but it looks to me that this is a separate go routine that checks the chain backend, killing the process if it fails, but it has nothing to do with handling errors during RPC calls?

Roasbeef commented 3 years ago

If any HTTP request to bitcoind fails during sync, the process shuts down

Are requests failing due to timeouts, or the network itself being unreliable? Typically we see users use a sort of hyper visor to automatically restart lnd in the background if there's a networking issue. Even if you add extra retries at the rpcclient level in btcsuite, you'd eventually need to bail out and rely on a restart at a higher level, right?

3nprob commented 2 years ago

Experiencing this now on a node requiring a wallet rescan. At some point on startup, one of the requests will fail:

[ERR] LNWL: Unable to complete chain rescan: Post "http://127.0.0.1:8330": read tcp 127.0.0.1:51674->127.0.0.1:8330: read: connection reset by peer

This effectively ends up in a restart loop (with wallet unlock required on each) and the node is unable to start up. Adding retry behavior would allow it to complete, as the errors are transient.

If I understand the dependency resolution right and it doesn't get dropped, this should be resolved when #6285 is merged, since it brings in https://github.com/btcsuite/btcd/pull/1743