lightningnetwork / lnd

Lightning Network Daemon ⚡️
MIT License
7.67k stars 2.07k forks source link

Very slow start of LND even with Postgresql backend #6646

Open GordianLN opened 2 years ago

GordianLN commented 2 years ago

Background

Starting, or restarting, my node will imply a wait time of about 20 minutes, of which 20s are spent opening the postgresql database, then a long series of [ERR] RPCS: [/lnrpc.Lightning/GetInfo]: the RPC server is in the process of starting up, but not yet ready to accept calls are produced (either GetInfo, or listpeers, or any other call that my running scripts send to lnd to fetch data) for about the next 10 minutes, and then another long series of [ERR] RPCS: [/lnrpc.Lightning/ConnectPeer]: server is still in the process of starting is spat out for roughly another 10 minutes, until channels start to be actually reconnected.

As mentioned before, I do have lndg jobs running every few seconds and my scripts checking whether lnd is alive with getinfo, but disabling them doesn't really appear to speed up anything.

Your environment

Steps to reproduce

Restart lnd on raspberry pi 4 8gb

Expected behaviour

Startup in a sane time (up to a coupld of minutes after database is opened maybe?)

Actual behaviour

lnd takes 20 minutes after database open to actually be operational

lucasdcf commented 2 years ago

Is sync-freelist=1? Is it compacting or not?

GordianLN commented 2 years ago

Is sync-freelist=1? Is it compacting or not?

sync-freelist is true., and this being postgresql backend it doesn't compact on start, rather I run compaction with VACUUM and VACUUM FULL from crontab

Roasbeef commented 2 years ago

I think we can close this as a duplicate of https://github.com/lightningnetwork/lnd/issues/6187?

nicolasburtey commented 2 years ago

@openoms @krtk6160 @sandipndev

GordianLN commented 2 years ago

I think we can close this as a duplicate of #6187?

I think we can... how do I know that it's indeed because of forced synchronous graph caching though? I know about other Pi4 nodes that take a handful of minutes to start, or at least that's what they told me

Roasbeef commented 2 years ago

how do I know that it's indeed because of forced synchronous graph caching though? I know about other Pi4 nodes that take a handful of minutes to start, or at least that's what they told me

A log line is printed that shows how long it took to initialize the cache: https://github.com/lightningnetwork/lnd/blob/262591c3331c2170697fe89b8c49ab857db0adf7/channeldb/graph.go#L243-L244

Roasbeef commented 2 years ago

I know about other Pi4 nodes that take a handful of minutes to start, or at least that's what they told me

For rasp-pi's this might happen if they have a large database: the free list needs to be reconstructed on start up. This means some initial I/O and CPU burst. The CPU of a pi isn't exactly beefy so this can take some time especially if a lot of other stuff is running on the instance. In that case, 0.15 helps a lot as it keeps less state around, and the upcoming optional migration will cut down a lot of useless data which'll speed up all DB operations.

GordianLN commented 2 years ago

No luck.

zgrep populating .lnd/logs/bitcoin/mainnet/lnd.log.*.gz yielded no results, while zgrep Finished .lnd/logs/bitcoin/mainnet/lnd.log.*.gz only returned

.lnd/logs/bitcoin/mainnet/lnd.log.443.gz:2022-07-02 XX:52:35.142 [INF] HSWC: Finished cleaning: num_closed_channel=13, num_circuits=0, num_keystone=0
.lnd/logs/bitcoin/mainnet/lnd.log.443.gz:2022-07-02 XX:52:41.687 [INF] LNWL: Finished rescan for 129 addresses (synced to block 00000000000000000002e5fdd7113e91148372d5e1e5cf3b5c21772c75c97f86, height 743328)
Roasbeef commented 2 years ago

The log only prints on start up, but if you don't have the CHDB log sub-system set to debug, then it won't show up. So you might need to restart w/ that logging level to be able to see this log.

GordianLN commented 2 years ago

@Roasbeef Doesn't look like what you said is the root of the problem: Finished populating in-memory channel graph (took 21.812692264s, num_node_features=13655, num_nodes=13589, num_channels=145654) about the same amount of time it took to open the detabase. So something else must be causing the huge delay until lnd is actually operative.

yyforyongyu commented 1 year ago

Ran into a similar issue in one of the itest builds,

2023-02-28 08:00:44.389 [INF] BTCN: Server listening on 127.0.0.1:7563
...
2023-02-28 08:00:44.877 [DBG] RPCS: [ListPayments]
2023-02-28 08:00:48.100 [DBG] RPCS: [walletbalance] Total balance=100.46129986 BTC (confirmed=100.46129986 BTC, unconfirmed=0 BTC)

// 27 seconds passed, which is quite long for a local node.

2023-02-28 08:01:05.104 [INF] INVC: New invoice subscription client: id=1
2023-02-28 08:01:05.607 [DBG] INVC: Invoice(pay_hash=d5a4938caeb65af73b44a4db30243879e029538f4832c0cb6bf30cbd69b09958): added with terms amt=123000 mSAT, expiry=24h0m0s, final_cltv_delta=40
...
2023-02-28 08:01:07.211 [ERR] RPCS: WS: error writing message to upstream http server: io: read/write on closed pipe
...

A slow startup caused the test to fail, need more investigation.

MichaelAntonFischer commented 1 year ago

Since this is still open, let me add a comment. I am since 0.16.3 experiencing the same issue. Node remains in [ERR] RPCS: [/lnrpc.Lightning/GetInfo]: waiting to start, RPC services not available for 30+ minutes, on a really performant VPS, using Postgres.

Sometimes it doesn't even start at all, after being in that state for hours.

Any ideas how to speed up the boot (now using 0.17.0 issue still persists).

guggero commented 1 year ago

@MichaelAntonFischer what hardware is this on? We recently observed a case where a Raspberry Pi simply didn't have a good enough power supply and dips in the voltage caused the SSD to just lock up Postgres operations for minutes. Looking at the description of the original poster, this could very well be the case for them here as well.

MichaelAntonFischer commented 1 year ago

It's a VPS with 16 cores and 32 GB RAM, so that cannot be the issue.

guggero commented 1 year ago

Okay, yeah, doesn't sound like it. Can you check what Postgres is doing when it's so slow? There are a couple of ideas for debugging here: https://github.com/lightningnetwork/lnd/issues/8009 Also, pulling a goroutine dump might be helpful, especially when using /debug/pprof/goroutine?debug=2 instead of ?debug=1.

MichaelAntonFischer commented 1 year ago

Thanks for the tips. Before I go into the extensive testing... I saw that there were recommended settings, when using Postgres. Sicne the issue only appeared on newer LND versions, what are the current recommended config values when using postgres?

guggero commented 1 year ago

I think using multiple connections (e.g. 20 or more, make sure that Postgres also has at least that many configured) and setting a timeout (to something like 1m or 5m) is probably a good idea. Other than that, I don't think there are any other recommended settings. BTW, https://github.com/lightningnetwork/lnd/pull/7992 which is planned to land in v0.17.1-beta should definitely improve things for Postgres quite a bit.

blckbx commented 11 months ago

linking discussion #8176 as it seems somewhat related and provides go routine dumps