lightningnetwork / lnd

Lightning Network Daemon ⚡️
MIT License
7.65k stars 2.07k forks source link

lnd won't start with postgres after compiling with kvdb_postgres tag, sometimes starts but fails right after #5981

Closed GordianLN closed 2 years ago

GordianLN commented 2 years ago

Background

Following #5975, after compiling lnd from source with postegres enabled, I am still not able to start lnd; database tables get sometimes initialized but I cannot get a working instance out of them.

Your environment

Steps to reproduce

Compiled as per https://github.com/lightningnetwork/lnd/blob/master/docs/INSTALL.md#building-a-development-version-from-source I edited lnd/make/release_flags.mk adding kvdb_postgres to tags, and commenting out any platform except linux arm64. Sidenote: I am a newbie, I spent quite a deal of time to reach this point, having to delve into code and several searches... a little fast addition to the docs could save inexperienced people like me a great deal of suffering. I just made a pull request of it: https://github.com/lightningnetwork/lnd/pull/5982 Relevant config:

[db]
db.backend=postgres
db.postgres.dsn=postgresql://bitcoin:bitcoin@127.0.0.1:5432/bitcoin
db.postgres.timeout=20s

Expected behaviour

A node is started from scratch, I can recover from SCB, lnd works on a postgres backend.

Actual behaviour

I get this in journalctl a few times:

lnd[16541]:[INF] LTND: Opening the main database, this might take a few minutes...
lnd[16541]:[INF] LTND: Using remote postgres database! Creating graph and channel state DB instances
lnd[16541]:[ERR] LTND: unable to open graph DB: unable to create new channel graph: context

lnd[16541]:[ERR] LTND: Shutting down because error in main method: unable to open databases: unable to open graph DB: unable to create new channel graph: context canceled
lnd[16541]:[INF] LTND: Shutdown complete
lnd[16541]: unable to open databases: unable to open graph DB: unable to create new channel graph: context canceled
systemd[1]: lnd.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: lnd.service: Failed with result 'exit-code'.

Then, at the Nth iteration, without any intervention from my part, I get:

[INF] LTND: Opening the main database, this might take a few minutes...
[INF] LTND: Using remote postgres database! Creating graph and channel state DB instances
[INF] CHDB: Checking for schema update: latest_version=24, db_version=24
[INF] LTND: Database(s) now open (time_to_open=301.123353ms)!
[INF] LTND: We're not running within systemd
[INF] LTND: Waiting for wallet encryption password. Use `lncli create` to create a wallet, `lncli unlock` to unlock an existing wallet, or `lncli changepassword` to change the password of an existing wallet and unlock it.

I run lncli create to recover funds from previous installation's seed, I enter known seed, passphrase, look ahead, at the end of which I get shown the seed again, and right after, [lncli] rpc error: code = Unavailable desc = error reading from server: EOF, in journal the previous logs are followed by:

[CRT] CHDB: Caught unhandled error: context canceled
[INF] CHDB: Sending request for shutdown
[INF] LTND: Received shutdown request.
[INF] LTND: Shutting down...
[INF] LTND: Gracefully shutting down.
[ERR] LTND: Could not unload new wallet: wallet is not loaded
[ERR] LTND: unable to set up wallet password listeners: context canceled
[ERR] LTND: Shutting down because error in main method: error creating wallet config: unable to set up wallet password listeners: context canceled
[ERR] RPCS: [/lnrpc.WalletUnlocker/InitWallet]: got no unlock message before timeout
[INF] LTND: Shutdown complete
error creating wallet config: unable to set up wallet password listeners: context canceled
lnd.service: Main process exited, code=exited, status=1/FAILURE
lnd.service: Failed with result 'exit-code'.

Database tables have been initialized, as shown in here after running psql from bitcoin user cli:

psql (11.14 (Raspbian 11.14-0+deb10u1))
Type "help" for help.

bitcoin=# \dt
              List of relations
 Schema |       Name       | Type  |  Owner
--------+------------------+-------+---------
 public | channeldb_kv     | table | bitcoin
 public | decayedlogdb_kv  | table | bitcoin
 public | macaroondb_kv    | table | bitcoin
 public | towerclientdb_kv | table | bitcoin
 public | towerserverdb_kv | table | bitcoin
 public | walletdb_kv      | table | bitcoin
(6 rows)

bitcoin=# select * from walletdb_kv
bitcoin-# select * from decayedlogdb_kv
bitcoin-# select * from macaroondb_kv
bitcoin-# select * from channeldb_kv
bitcoin-#
guggero commented 2 years ago

Sounds like a weird interaction with db.postgres.timeout=20s. Can you try if setting the timeout to 0 allows you to unlock the wallet normally?

GordianLN commented 2 years ago

What do you know, it does start by setting the timeout to 0, I was able to test once again on-chain and off-chain recovery, this time on postgres. What should be the role of the timeout in this case?

From journalctl I now notice long bursts (tens of lines each time) of: [ERR] RPCS: [/lnrpc.Lightning/GetTransactions]: failed to connect tohost=127.0.0.1 user=bitcoin database=bitcoin: server error (FATAL: sorry, too many clients already (SQLSTATE 53300))

guggero commented 2 years ago

You need to increase the number of allowed connections (I think it's -N xxx, so set -N 500 as we do in our integration tests) until this is merged: https://github.com/lightningnetwork/lnd/pull/5992