Open GordianLN opened 2 years ago
Is sync-freelist=1? Is it compacting or not?
Is sync-freelist=1? Is it compacting or not?
sync-freelist is true., and this being postgresql backend it doesn't compact on start, rather I run compaction with VACUUM
and VACUUM FULL
from crontab
I think we can close this as a duplicate of https://github.com/lightningnetwork/lnd/issues/6187?
@openoms @krtk6160 @sandipndev
I think we can close this as a duplicate of #6187?
I think we can... how do I know that it's indeed because of forced synchronous graph caching though? I know about other Pi4 nodes that take a handful of minutes to start, or at least that's what they told me
how do I know that it's indeed because of forced synchronous graph caching though? I know about other Pi4 nodes that take a handful of minutes to start, or at least that's what they told me
A log line is printed that shows how long it took to initialize the cache: https://github.com/lightningnetwork/lnd/blob/262591c3331c2170697fe89b8c49ab857db0adf7/channeldb/graph.go#L243-L244
I know about other Pi4 nodes that take a handful of minutes to start, or at least that's what they told me
For rasp-pi's this might happen if they have a large database: the free list needs to be reconstructed on start up. This means some initial I/O and CPU burst. The CPU of a pi isn't exactly beefy so this can take some time especially if a lot of other stuff is running on the instance. In that case, 0.15 helps a lot as it keeps less state around, and the upcoming optional migration will cut down a lot of useless data which'll speed up all DB operations.
No luck.
zgrep populating .lnd/logs/bitcoin/mainnet/lnd.log.*.gz
yielded no results, while
zgrep Finished .lnd/logs/bitcoin/mainnet/lnd.log.*.gz
only returned
.lnd/logs/bitcoin/mainnet/lnd.log.443.gz:2022-07-02 XX:52:35.142 [INF] HSWC: Finished cleaning: num_closed_channel=13, num_circuits=0, num_keystone=0
.lnd/logs/bitcoin/mainnet/lnd.log.443.gz:2022-07-02 XX:52:41.687 [INF] LNWL: Finished rescan for 129 addresses (synced to block 00000000000000000002e5fdd7113e91148372d5e1e5cf3b5c21772c75c97f86, height 743328)
The log only prints on start up, but if you don't have the CHDB
log sub-system set to debug
, then it won't show up. So you might need to restart w/ that logging level to be able to see this log.
@Roasbeef
Doesn't look like what you said is the root of the problem:
Finished populating in-memory channel graph (took 21.812692264s, num_node_features=13655, num_nodes=13589, num_channels=145654)
about the same amount of time it took to open the detabase.
So something else must be causing the huge delay until lnd is actually operative.
Ran into a similar issue in one of the itest builds,
2023-02-28 08:00:44.389 [INF] BTCN: Server listening on 127.0.0.1:7563
...
2023-02-28 08:00:44.877 [DBG] RPCS: [ListPayments]
2023-02-28 08:00:48.100 [DBG] RPCS: [walletbalance] Total balance=100.46129986 BTC (confirmed=100.46129986 BTC, unconfirmed=0 BTC)
// 27 seconds passed, which is quite long for a local node.
2023-02-28 08:01:05.104 [INF] INVC: New invoice subscription client: id=1
2023-02-28 08:01:05.607 [DBG] INVC: Invoice(pay_hash=d5a4938caeb65af73b44a4db30243879e029538f4832c0cb6bf30cbd69b09958): added with terms amt=123000 mSAT, expiry=24h0m0s, final_cltv_delta=40
...
2023-02-28 08:01:07.211 [ERR] RPCS: WS: error writing message to upstream http server: io: read/write on closed pipe
...
A slow startup caused the test to fail, need more investigation.
Since this is still open, let me add a comment. I am since 0.16.3 experiencing the same issue. Node remains in [ERR] RPCS: [/lnrpc.Lightning/GetInfo]: waiting to start, RPC services not available for 30+ minutes, on a really performant VPS, using Postgres.
Sometimes it doesn't even start at all, after being in that state for hours.
Any ideas how to speed up the boot (now using 0.17.0 issue still persists).
@MichaelAntonFischer what hardware is this on? We recently observed a case where a Raspberry Pi simply didn't have a good enough power supply and dips in the voltage caused the SSD to just lock up Postgres operations for minutes. Looking at the description of the original poster, this could very well be the case for them here as well.
It's a VPS with 16 cores and 32 GB RAM, so that cannot be the issue.
Okay, yeah, doesn't sound like it. Can you check what Postgres is doing when it's so slow? There are a couple of ideas for debugging here: https://github.com/lightningnetwork/lnd/issues/8009
Also, pulling a goroutine dump might be helpful, especially when using /debug/pprof/goroutine?debug=2
instead of ?debug=1
.
Thanks for the tips. Before I go into the extensive testing... I saw that there were recommended settings, when using Postgres. Sicne the issue only appeared on newer LND versions, what are the current recommended config values when using postgres?
I think using multiple connections (e.g. 20 or more, make sure that Postgres also has at least that many configured) and setting a timeout (to something like 1m
or 5m
) is probably a good idea. Other than that, I don't think there are any other recommended settings.
BTW, https://github.com/lightningnetwork/lnd/pull/7992 which is planned to land in v0.17.1-beta
should definitely improve things for Postgres quite a bit.
linking discussion #8176 as it seems somewhat related and provides go routine dumps
Background
Starting, or restarting, my node will imply a wait time of about 20 minutes, of which 20s are spent opening the postgresql database, then a long series of
[ERR] RPCS: [/lnrpc.Lightning/GetInfo]: the RPC server is in the process of starting up, but not yet ready to accept calls
are produced (either GetInfo, or listpeers, or any other call that my running scripts send to lnd to fetch data) for about the next 10 minutes, and then another long series of[ERR] RPCS: [/lnrpc.Lightning/ConnectPeer]: server is still in the process of starting
is spat out for roughly another 10 minutes, until channels start to be actually reconnected.As mentioned before, I do have lndg jobs running every few seconds and my scripts checking whether lnd is alive with getinfo, but disabling them doesn't really appear to speed up anything.
Your environment
lnd
0.14.3uname -a
on *Nix) raspberry bullseyebtcd
,bitcoind
, or other backend 0.22Steps to reproduce
Restart lnd on raspberry pi 4 8gb
Expected behaviour
Startup in a sane time (up to a coupld of minutes after database is opened maybe?)
Actual behaviour
lnd takes 20 minutes after database open to actually be operational