Open sandipndev opened 2 years ago
Hmm, looks like it's running into issues with the number of connections allowed. @bhandras is 21 connections enough?
@sandipndev I have some questions that may help gathering extra info for resolving the issue:
SELECT * FROM pg_stat_activity;
)As @arshbot pointed out you could update the number of connections from the server instance to ~100 or so and see if the problem persists. It would be useful to check from time to time the number of active connections just in case lnd (or other service) is leaking them until the server gets saturated.
@sandipndev LND will use at most the configured number of connections and any new db operations requiring a connection will wait. They may still timeout occasionally but it seems that @positiveblue is right and your db has connections from other services taking up configured slots.
A profile would be useful here to figure out where it's spending its time: https://github.com/lightningnetwork/lnd/issues/6708#issuecomment-1176816954.
This linked issue also shows GetInfo
spending a lot of time.
@Roasbeef
We generated a profile while the issue was happening with:
curl http://localhost:9736/debug/pprof/goroutine?debug=1 > save
But we couldn't seem to find any labels on the nodes/edges of the graph. I am posting the raw file here: lnd-profile.txt
@positiveblue here's pg_stat_activity:
postgres=> SELECT * FROM pg_stat_activity;
datid | datname | pid | leader_pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start | query_start | state_change | wait_event_type | wait_event | state | backend_xid | backend_xmin | query_id | query | backend_type
-------+---------------+------+------------+----------+---------------------------+------------------+---------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-----------------+---------------------+---------------------+-------------+--------------+---------------------+--------------------------------------------------------------------------------------------------------------------------------------------+------------------------------
| | 31 | | | | | | | | 2022-07-08 11:57:46.874498+00 | | | | Activity | AutoVacuumMain | | | | | | autovacuum launcher
| | 33 | | 10 | cloudsqladmin | | | | | 2022-07-08 11:57:46.87518+00 | | | | Activity | LogicalLauncherMain | | | | | | logical replication launcher
16467 | lnd_db | 4875 | | 16466 | lnd_user | | 192.168.3.203 | | 40938 | 2022-07-08 12:59:21.412126+00 | 2022-07-08 12:59:47.369432+00 | 2022-07-08 13:00:13.080576+00 | 2022-07-08 13:00:13.080591+00 | Client | ClientRead | idle in transaction | 765886 | 765886 | | SELECT key, value FROM channeldb_kv WHERE parent_id=22 AND key>$1 ORDER BY key LIMIT 1 | client backend
16384 | cloudsqladmin | 149 | | 10 | cloudsqladmin | cloudsqlagent | 127.0.0.1 | | 47728 | 2022-07-08 11:57:59.580033+00 | | 2022-07-08 13:00:13.014436+00 | 2022-07-08 13:00:13.01448+00 | Client | ClientRead | idle | | | | ; | client backend
14355 | postgres | 4971 | | 16465 | galoy-hack-lnd-2-pg-admin | psql | 10.0.0.5 | | 53626 | 2022-07-08 12:59:51.501542+00 | 2022-07-08 13:00:13.078224+00 | 2022-07-08 13:00:13.078224+00 | 2022-07-08 13:00:13.078226+00 | | | active | | 765886 | 941978042436931562 | SELECT * FROM pg_stat_activity; | client backend
16467 | lnd_db | 4981 | | 16466 | lnd_user | | 192.168.3.203 | | 42032 | 2022-07-08 13:00:05.25491+00 | | 2022-07-08 13:00:05.270694+00 | 2022-07-08 13:00:05.270723+00 | Client | ClientRead | idle | | | 8085340880788646241 | commit | client backend
16467 | lnd_db | 4917 | | 16466 | lnd_user | | 192.168.3.203 | | 41226 | 2022-07-08 12:59:33.277209+00 | | 2022-07-08 13:00:05.254759+00 | 2022-07-08 13:00:05.254828+00 | Client | ClientRead | idle | | | 8085340880788646241 | commit | client backend
16384 | cloudsqladmin | 4972 | | 10 | cloudsqladmin | cloudsqlagent | 127.0.0.1 | | 50424 | 2022-07-08 13:00:02.414083+00 | | 2022-07-08 13:00:02.504293+00 | 2022-07-08 13:00:02.50466+00 | Client | ClientRead | idle | | | 3659322760672273111 | SELECT archived_count, last_archived_time, failed_count, last_failed_time, stats_reset, current_timestamp FROM pg_catalog.pg_stat_archiver | client backend
| | 29 | | | | | | | | 2022-07-08 11:57:46.873886+00 | | | | Activity | BgWriterHibernate | | | | | | background writer
| | 28 | | | | | | | | 2022-07-08 11:57:46.873586+00 | | | | Activity | CheckpointerMain | | | | | | checkpointer
| | 30 | | | | | | | | 2022-07-08 11:57:46.8742+00 | | | | Activity | WalWriterMain | | | | | | walwriter
(11 rows)
@bhandras
It's super weird to me because it looks like lnd_db
(configured database to be used at lnd.conf) is being used by just 3 connections (while maxconnections
is set to 21) and still the problem of slow getinfo
return persists. For this test, we thought it'd be easier to bechmark if the db instance was bigger and we increased it to 2vCPU/8GB RAM dedicated database instace. And still the problem persisted which is kind of proof that the problem is not related to the database instance.
@sandipndev are you building the binary in a special way? You can also try the direct-to-web variant (port forwarding can be used as well if its a remote machine):
go tool pprof --http=localhost:8083 http://localhost:9736/debug/pprof/profile
This'll open up a new page in a browser window where you should be able to get the proper SVG profile, and also a flame graph as well.
Re your snapshot of the number of active connections: ideally you'd run that when you see the error in lnd, as various RPC calls or scenarios may be what's leading to the amount of available connections being consumed. Looking at this page the error code 53300
matches up w/ the hypothetical scenario (a burst of connections is created and not enough are made available).
Possibly related to: https://github.com/lightningnetwork/lnd/pull/5595
This is also happening with boltdb backend, with 200ish channels. We could write an itest to open lots of channels and benchmark it.
Background
We have set up LND with a Postgres database, and it has a dedicated CPU with about 4GB of RAM requested and available for itself. The database is populated and it's not like it's not responding but the responses are extremely slow. In our cluster, we also have
lndmon
running for the metrics exporting but clearly, it has way too many "restarts" on kubernetes because it is failing and timing out:We had opened a channel to this LND instance and because of this erratic behaviour of not being able to respond properly all the time, the channel had gotten force-closed as well.
Logs from LND:
Logs from lndmon:
Environment
0.15.0-beta commit=v0.15.0-beta
lightninglabs/lnd
Expected behaviour
LND should at least be responding all the time.
Actual behaviour
There is some caching involved in the RPC I believe which is why after one
getinfo
call, there is about 5m time when requests don't time out and after that, the cycle repeats.