lightningnetwork / lnd

Lightning Network Daemon ⚡️
MIT License
7.72k stars 2.09k forks source link

Running `getinfo` with postgresql db is very slow #6702

Open sandipndev opened 2 years ago

sandipndev commented 2 years ago

Background

We have set up LND with a Postgres database, and it has a dedicated CPU with about 4GB of RAM requested and available for itself. The database is populated and it's not like it's not responding but the responses are extremely slow. In our cluster, we also have lndmon running for the metrics exporting but clearly, it has way too many "restarts" on kubernetes because it is failing and timing out:

NAME                           READY   STATUS    RESTARTS   AGE
bitcoind-0                     2/2     Running   0          2d21h
lnd1-0                         3/3     Running   7          2d22h
lnd1-lndmon-78997ccdf4-7plz9   1/1     Running   746        2d21h

We had opened a channel to this LND instance and because of this erratic behaviour of not being able to respond properly all the time, the channel had gotten force-closed as well.

Logs from LND:

$ kubectl logs lnd1-0 -c lnd | grep ERR | tail
2022-07-04 06:03:34.203 [ERR] DISC: unable to validate channel update short_chan_id=759840600204771329: failed to connect to `host=10.36.0.12 user=lnd_user database=lnd_db`: server error (FATAL: remaining connection slots are reserved for non-replication superuser connections (SQLSTATE 53300))
2022-07-04 06:12:06.942 [ERR] DISC: unable to validate channel update short_chan_id=797497774017609729: failed to connect to `host=10.36.0.12 user=lnd_user database=lnd_db`: server error (FATAL: remaining connection slots are reserved for non-replication superuser connections (SQLSTATE 53300))
2022-07-04 06:12:40.412 [ERR] RPCS: [/routerrpc.Router/SubscribeHtlcEvents]: context canceled
2022-07-04 06:20:44.384 [ERR] DISC: unable to validate channel update short_chan_id=779182109372448769: failed to connect to `host=10.36.0.12 user=lnd_user database=lnd_db`: server error (FATAL: remaining connection slots are reserved for non-replication superuser connections (SQLSTATE 53300))
2022-07-04 06:22:35.041 [ERR] RPCS: [/routerrpc.Router/SubscribeHtlcEvents]: context canceled
2022-07-04 06:27:20.468 [ERR] RPCS: [/routerrpc.Router/SubscribeHtlcEvents]: context canceled
2022-07-04 06:47:02.180 [ERR] DISC: unable to validate channel update short_chan_id=763321653983117313: failed to connect to `host=10.36.0.12 user=lnd_user database=lnd_db`: server error (FATAL: remaining connection slots are reserved for non-replication superuser connections (SQLSTATE 53300))
2022-07-04 07:01:22.169 [ERR] DISC: unable to validate channel update short_chan_id=799154738011242497: failed to connect to `host=10.36.0.12 user=lnd_user database=lnd_db`: server error (FATAL: remaining connection slots are reserved for non-replication superuser connections (SQLSTATE 53300))
2022-07-04 07:07:03.159 [ERR] DISC: unable to validate channel update short_chan_id=814529209106300929: failed to connect to `host=10.36.0.12 user=lnd_user database=lnd_db`: server error (FATAL: remaining connection slots are reserved for non-replication superuser connections (SQLSTATE 53300))
2022-07-04 07:09:51.814 [ERR] DISC: unable to validate channel update short_chan_id=758370553249398784: failed to connect to `host=10.36.0.12 user=lnd_user database=lnd_db`: server error (FATAL: remaining connection slots are reserved for non-replication superuser connections (SQLSTATE 53300))

Logs from lndmon:

$ k logs lnd1-lndmon-78997ccdf4-7plz9
2022-07-04 06:28:32.725 [INF] LNDMON: Starting Prometheus exporter...
2022-07-04 06:28:32.726 [INF] HTLC: Starting Htlc Monitor
2022-07-04 06:28:32.727 [INF] LNDMON: Prometheus active!
rpc error: code = DeadlineExceeded desc = context deadline exceeded
2022/07/04 07:01:18 http: superfluous response.WriteHeader call from github.com/lightninglabs/lndmon/health.(*Checker).ServeHTTP (health.go:76)
rpc error: code = DeadlineExceeded desc = context deadline exceeded
2022/07/04 07:01:18 http: superfluous response.WriteHeader call from github.com/lightninglabs/lndmon/health.(*Checker).ServeHTTP (health.go:76)
rpc error: code = DeadlineExceeded desc = context deadline exceeded
2022/07/04 07:09:48 http: superfluous response.WriteHeader call from github.com/lightninglabs/lndmon/health.(*Checker).ServeHTTP (health.go:76)
rpc error: code = DeadlineExceeded desc = context deadline exceeded
2022/07/04 07:09:48 http: superfluous response.WriteHeader call from github.com/lightninglabs/lndmon/health.(*Checker).ServeHTTP (health.go:76)
rpc error: code = DeadlineExceeded desc = context deadline exceeded
2022/07/04 07:25:18 http: superfluous response.WriteHeader call from github.com/lightninglabs/lndmon/health.(*Checker).ServeHTTP (health.go:76)
rpc error: code = DeadlineExceeded desc = context deadline exceeded
2022/07/04 07:25:18 http: superfluous response.WriteHeader call from github.com/lightninglabs/lndmon/health.(*Checker).ServeHTTP (health.go:76)

Environment

db.backend=postgres
db.postgres.dsn=<sensitive>
db.postgres.timeout=0
db.postgres.maxconnections=21

Expected behaviour

LND should at least be responding all the time.

Actual behaviour

There is some caching involved in the RPC I believe which is why after one getinfo call, there is about 5m time when requests don't time out and after that, the cycle repeats.

arshbot commented 2 years ago

Hmm, looks like it's running into issues with the number of connections allowed. @bhandras is 21 connections enough?

positiveblue commented 2 years ago

@sandipndev I have some questions that may help gathering extra info for resolving the issue:

As @arshbot pointed out you could update the number of connections from the server instance to ~100 or so and see if the problem persists. It would be useful to check from time to time the number of active connections just in case lnd (or other service) is leaking them until the server gets saturated.

bhandras commented 2 years ago

@sandipndev LND will use at most the configured number of connections and any new db operations requiring a connection will wait. They may still timeout occasionally but it seems that @positiveblue is right and your db has connections from other services taking up configured slots.

Roasbeef commented 2 years ago

A profile would be useful here to figure out where it's spending its time: https://github.com/lightningnetwork/lnd/issues/6708#issuecomment-1176816954.

This linked issue also shows GetInfo spending a lot of time.

sandipndev commented 2 years ago

@Roasbeef We generated a profile while the issue was happening with: curl http://localhost:9736/debug/pprof/goroutine?debug=1 > save

But we couldn't seem to find any labels on the nodes/edges of the graph. I am posting the raw file here: lnd-profile.txt

profile001

sandipndev commented 2 years ago

@positiveblue here's pg_stat_activity:

postgres=> SELECT * FROM pg_stat_activity;
 datid |    datname    | pid  | leader_pid | usesysid |          usename          | application_name |  client_addr  | client_hostname | client_port |         backend_start         |          xact_start           |          query_start          |         state_change          | wait_event_type |     wait_event      |        state        | backend_xid | backend_xmin |      query_id       |                                                                   query                                                                    |         backend_type         
-------+---------------+------+------------+----------+---------------------------+------------------+---------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-----------------+---------------------+---------------------+-------------+--------------+---------------------+--------------------------------------------------------------------------------------------------------------------------------------------+------------------------------
       |               |   31 |            |          |                           |                  |               |                 |             | 2022-07-08 11:57:46.874498+00 |                               |                               |                               | Activity        | AutoVacuumMain      |                     |             |              |                     |                                                                                                                                            | autovacuum launcher
       |               |   33 |            |       10 | cloudsqladmin             |                  |               |                 |             | 2022-07-08 11:57:46.87518+00  |                               |                               |                               | Activity        | LogicalLauncherMain |                     |             |              |                     |                                                                                                                                            | logical replication launcher
 16467 | lnd_db        | 4875 |            |    16466 | lnd_user                  |                  | 192.168.3.203 |                 |       40938 | 2022-07-08 12:59:21.412126+00 | 2022-07-08 12:59:47.369432+00 | 2022-07-08 13:00:13.080576+00 | 2022-07-08 13:00:13.080591+00 | Client          | ClientRead          | idle in transaction |      765886 |       765886 |                     | SELECT key, value FROM channeldb_kv WHERE parent_id=22 AND key>$1 ORDER BY key LIMIT 1                                                     | client backend
 16384 | cloudsqladmin |  149 |            |       10 | cloudsqladmin             | cloudsqlagent    | 127.0.0.1     |                 |       47728 | 2022-07-08 11:57:59.580033+00 |                               | 2022-07-08 13:00:13.014436+00 | 2022-07-08 13:00:13.01448+00  | Client          | ClientRead          | idle                |             |              |                     | ;                                                                                                                                          | client backend
 14355 | postgres      | 4971 |            |    16465 | galoy-hack-lnd-2-pg-admin | psql             | 10.0.0.5      |                 |       53626 | 2022-07-08 12:59:51.501542+00 | 2022-07-08 13:00:13.078224+00 | 2022-07-08 13:00:13.078224+00 | 2022-07-08 13:00:13.078226+00 |                 |                     | active              |             |       765886 |  941978042436931562 | SELECT * FROM pg_stat_activity;                                                                                                            | client backend
 16467 | lnd_db        | 4981 |            |    16466 | lnd_user                  |                  | 192.168.3.203 |                 |       42032 | 2022-07-08 13:00:05.25491+00  |                               | 2022-07-08 13:00:05.270694+00 | 2022-07-08 13:00:05.270723+00 | Client          | ClientRead          | idle                |             |              | 8085340880788646241 | commit                                                                                                                                     | client backend
 16467 | lnd_db        | 4917 |            |    16466 | lnd_user                  |                  | 192.168.3.203 |                 |       41226 | 2022-07-08 12:59:33.277209+00 |                               | 2022-07-08 13:00:05.254759+00 | 2022-07-08 13:00:05.254828+00 | Client          | ClientRead          | idle                |             |              | 8085340880788646241 | commit                                                                                                                                     | client backend
 16384 | cloudsqladmin | 4972 |            |       10 | cloudsqladmin             | cloudsqlagent    | 127.0.0.1     |                 |       50424 | 2022-07-08 13:00:02.414083+00 |                               | 2022-07-08 13:00:02.504293+00 | 2022-07-08 13:00:02.50466+00  | Client          | ClientRead          | idle                |             |              | 3659322760672273111 | SELECT archived_count, last_archived_time, failed_count, last_failed_time, stats_reset, current_timestamp FROM pg_catalog.pg_stat_archiver | client backend
       |               |   29 |            |          |                           |                  |               |                 |             | 2022-07-08 11:57:46.873886+00 |                               |                               |                               | Activity        | BgWriterHibernate   |                     |             |              |                     |                                                                                                                                            | background writer
       |               |   28 |            |          |                           |                  |               |                 |             | 2022-07-08 11:57:46.873586+00 |                               |                               |                               | Activity        | CheckpointerMain    |                     |             |              |                     |                                                                                                                                            | checkpointer
       |               |   30 |            |          |                           |                  |               |                 |             | 2022-07-08 11:57:46.8742+00   |                               |                               |                               | Activity        | WalWriterMain       |                     |             |              |                     |                                                                                                                                            | walwriter
(11 rows)

@bhandras It's super weird to me because it looks like lnd_db (configured database to be used at lnd.conf) is being used by just 3 connections (while maxconnections is set to 21) and still the problem of slow getinfo return persists. For this test, we thought it'd be easier to bechmark if the db instance was bigger and we increased it to 2vCPU/8GB RAM dedicated database instace. And still the problem persisted which is kind of proof that the problem is not related to the database instance.

Roasbeef commented 2 years ago

@sandipndev are you building the binary in a special way? You can also try the direct-to-web variant (port forwarding can be used as well if its a remote machine):

 go tool pprof --http=localhost:8083 http://localhost:9736/debug/pprof/profile

This'll open up a new page in a browser window where you should be able to get the proper SVG profile, and also a flame graph as well.

Re your snapshot of the number of active connections: ideally you'd run that when you see the error in lnd, as various RPC calls or scenarios may be what's leading to the amount of available connections being consumed. Looking at this page the error code 53300 matches up w/ the hypothetical scenario (a burst of connections is created and not enough are made available).

Roasbeef commented 2 years ago

Possibly related to: https://github.com/lightningnetwork/lnd/pull/5595

yyforyongyu commented 1 year ago

This is also happening with boltdb backend, with 200ish channels. We could write an itest to open lots of channels and benchmark it.