Open 3nprob opened 3 years ago
Are you sure this isn't the healthcheck that is shutting down lnd
? What do you see in the logs?
You should be able to turn that off with healthcheck.chainbackend.attempts=0
.
Check lnd --help | grep health
for more information.
@guggero Yeah, I see in stderr (as opposed to stdout where other logs end up) a Connection failed
(or equivalent; I don't have the logs anymore. There was no log enrichment from lnd here, just an error bubbling up all the way from the http client in the btcd
module). In my speciic case tracked it down to: https://github.com/lightningnetwork/lnd/blob/93d12cd9fc34d42f37ebd573e6f39097b833aeed/lnwallet/btcwallet/blockchain.go#L156
During server.Start()
During startup (either initial or starting up after falling behind), this gets called for each block so probability of this causing a shutdown compounds with the number of blocks to process.
But looking around it seems that there's currently nothing in place to handle errors from calling any RPC functions so anything unexpected in that whole path will just bubble up and stop the process.. Or am I missing something?
FWIW lnd --help | grep health
returns nothing for me - is there some build tag I need to enable that? FWIW I didn't touch the config around that from the sample at some point:
[healthcheck]
healthcheck.chainbackend.attempts=3
healthcheck.chainbackend.timeout=10s
healthcheck.chainbackend.backoff=30s
healthcheck.chainbackend.interval=2m
I actually missed this healthcheck-backoff thing - correct me if I'm wrong here but it looks to me that this is a separate go routine that checks the chain backend, killing the process if it fails, but it has nothing to do with handling errors during RPC calls?
If any HTTP request to bitcoind fails during sync, the process shuts down
Are requests failing due to timeouts, or the network itself being unreliable? Typically we see users use a sort of hyper visor to automatically restart lnd
in the background if there's a networking issue. Even if you add extra retries at the rpcclient
level in btcsuite
, you'd eventually need to bail out and rely on a restart at a higher level, right?
Experiencing this now on a node requiring a wallet rescan. At some point on startup, one of the requests will fail:
[ERR] LNWL: Unable to complete chain rescan: Post "http://127.0.0.1:8330": read tcp 127.0.0.1:51674->127.0.0.1:8330: read: connection reset by peer
This effectively ends up in a restart loop (with wallet unlock required on each) and the node is unable to start up. Adding retry behavior would allow it to complete, as the errors are transient.
If I understand the dependency resolution right and it doesn't get dropped, this should be resolved when #6285 is merged, since it brings in https://github.com/btcsuite/btcd/pull/1743
Background
If any HTTP request to bitcoind fails during sync, the process shuts down. This is true also for intermittent errors like timeouts, networks errors, or bitcoind still syncing (related to #1533). Even under good conditions, when connecting to a bitcoin node over Tor, these kind of errors are normal.
I didn't find a clean way to address this in the LND codebase itself. I attempted to address this with this PR in the btcd/btcwallet codebase here: https://github.com/btcsuite/btcd/pull/1743
Have been running with this patch on testnet for some time and restarts have gone from multiple times per day to 0 over the past week.
Expected behaviour
LND retries failed requests with an exponential backoff
Actual behaviour
LND exits, requiring a wallet unlock on restart