Suboptimal Boundary database performance

macmiranda commented 1 year ago

Describe the bug (as originally posted on Hashicorp Discuss)

Hi community,

We’ve been consistently experiencing high load on our Boundary RDS PostgreSQL database. The impact is clearly seen when a user opens Boundary Desktop and has to wait for 7 to 10 seconds to see a list of Targets they can connect to or a list of Sessions that are active/pending. Authorizing a new session also takes a couple of seconds and used to be immediate during the implementation phase. These waiting times are less noticeable when using Boundary CLI. (update: while using curl to list Targets, the response time was within an acceptable range 0.4-0.5s; listing Sessions though took between 8 and 9 seconds).

The issue is somewhat similar to the one described here, supposedly fixed by enhancements in 0.8.x and 0.9.x. We’re running 0.12.2.

The DB instance is a db.t4g.xlarge with 4 vCPUs and 16GB RAM, which I would assume to be reasonably sized for a production deployment based on the reference architecture that uses a db.t3.micro.

Here you can see some stats of the top SQL commands by waits (the top query is the same pointed out in the Github issue above) stats

and performance metrics of the instance (notice 1 vCPU is consistently at max utilization i.e. ~25% of the total) metrics

We have 3 controllers running in a single Kubernetes cluster. The workers run on 6 different clusters (3 per cluster, with a few exceptions). Both controllers and workers are exposed using Kubernetes Ingress (workers leverage an ingress-nginx feature to expose TCP ports).

At any given time no more than 20 Boundary sessions are active across all clusters (since Boundary adoption is happening slowly). This was verified with the following SQL query (not sure if my interpretation of the result is correct though):

SELECT * FROM "public"."session_state"  WHERE ("state" = 'active') AND ("end_time" IS NULL)

Although the number of sessions is quite small, the number of rows in the session_connections table is quite high (160k rows) and in session_connection_states 3x as much (understandably, since it tracks 3 different states for each connection).

Some questions I have:

Is it normal to have so many session connections? We have only been using it for a couple of months in production. An extra piece of information: for every closed connection, the column closed_reason shows either unknown or system error which I find quite strange.
Is it reasonable to assume that regularly purging old connections from the session_connection table would reduce the load on the server? What other alternatives are there? What are the downsides of doing so besides the loss of historical data?
What else can I do to debug the high cpu utilization if what I described is completely unrelated to the symptoms I see?

To Reproduce

May be hard to reproduce in-lab as the issue may be related to the cardinality of session_connections.

Expected behavior

No sustained database load at 25%. Response time when listing targets and sessions under a second (thought they may not be related to the same issue).

Additional context

After further testing with Hashicorp Engineers, we noticed the average cpu utilization of the DB instance was reduced from ~25% to ~15% after the creation of the following index:

CREATE INDEX session_connection_worker_id_closed_reason_ix
    ON session_connection (worker_id, closed_reason);

after_index

Session-listing times weren't affected by this change and remained in the range between 8 and 9 seconds (for an admin with access to all Targets).

We also noticed that due to the type of application and protocol being proxied through Boundary some sessions end up having a lot of "child" connections:

connection count	session_id
70065	s_9a0SwaVQCg
48917	s_8CkGLFdRcc
26352	s_yijVTcLzay
11630	s_sEx10tttB2
5199	s_F5g0UKDvJU
2674	s_FjQEVDdIQy
2010	s_t4aym74hI0
1394	s_e7Xv80ML8o
1291	s_drJvt6vmf0
1026	s_2mTziw0ZjZ
457	s_BF3RcTzwbh
453	s_kMMBo4hvnm
192	s_dHwnZcsAL1
73	s_zDePrYWCxe
59	s_b2IzH36QX1
29	s_ejWlUr9snm
20	s_hHC9zUhIo9
18	s_bL9n4kyCXL
14	s_8GDqI1HkfR
12	s_u3kKyfKgZ1
11	s_BvDCtwmH0y
10	s_NzXmP9hljv
10	s_0R32kh9078
9	s_XwtjPNsJJr
8	s_8NUlPF6CG2
7	s_FXdW77rvXU
3	s_CDPTJ4AKJG
2	s_iOThqDP1rf
2	s_NCa0pfi7bl
2	s_guv25zIa6u
2	s_7lFUoV3vWO
1	s_USd522f6m5
1	s_qhSujJTedE
1	s_2m6TuxF8UL

tmessi commented 1 year ago

Reopening this for now. The index created in #3283 helps with some of the queries, but there are some others that need to be followed up on.

macmiranda commented 1 year ago

Short update: we were able to reduce the load on the DB even further after applying the 2 linked PR fixes by dropping old sessions from the session table that weren't connected to any existing project or scope. In fact, we just looked for really old sessions (that should have been closed anyway because of their max TTL but weren't) that didn't show up when running boundary sessions list -scope-id global -recursive as an admin. The Boundary database now runs on a db.t4g.small with ~10% average load compared to ~27% on a db.t4g.xlarge when we first started investigating the issue. Despite not knowing how exactly the dangling sessions came into existence, we feel the issue can be considered resolved for now. If we experience similar symptoms in the future, we'll open another issue and reference this one.

Thanks @tmessi, @mgaffney and all other Hashicorp people involved.

hashicorp / boundary

Suboptimal Boundary database performance #3281