hashicorp / boundary

Boundary enables identity-based access management for dynamic infrastructure.
https://boundaryproject.io
Other
3.85k stars 289 forks source link

Suboptimal Boundary database performance #3281

Closed macmiranda closed 1 year ago

macmiranda commented 1 year ago

Describe the bug (as originally posted on Hashicorp Discuss)

Hi community,

We’ve been consistently experiencing high load on our Boundary RDS PostgreSQL database. The impact is clearly seen when a user opens Boundary Desktop and has to wait for 7 to 10 seconds to see a list of Targets they can connect to or a list of Sessions that are active/pending. Authorizing a new session also takes a couple of seconds and used to be immediate during the implementation phase. These waiting times are less noticeable when using Boundary CLI. (update: while using curl to list Targets, the response time was within an acceptable range 0.4-0.5s; listing Sessions though took between 8 and 9 seconds).

The issue is somewhat similar to the one described here, supposedly fixed by enhancements in 0.8.x and 0.9.x. We’re running 0.12.2.

The DB instance is a db.t4g.xlarge with 4 vCPUs and 16GB RAM, which I would assume to be reasonably sized for a production deployment based on the reference architecture that uses a db.t3.micro.

Here you can see some stats of the top SQL commands by waits (the top query is the same pointed out in the Github issue above) stats

and performance metrics of the instance (notice 1 vCPU is consistently at max utilization i.e. ~25% of the total) metrics

We have 3 controllers running in a single Kubernetes cluster. The workers run on 6 different clusters (3 per cluster, with a few exceptions). Both controllers and workers are exposed using Kubernetes Ingress (workers leverage an ingress-nginx feature to expose TCP ports).

At any given time no more than 20 Boundary sessions are active across all clusters (since Boundary adoption is happening slowly). This was verified with the following SQL query (not sure if my interpretation of the result is correct though):

SELECT * FROM "public"."session_state"  WHERE ("state" = 'active') AND ("end_time" IS NULL)

Although the number of sessions is quite small, the number of rows in the session_connections table is quite high (160k rows) and in session_connection_states 3x as much (understandably, since it tracks 3 different states for each connection).

Some questions I have:

To Reproduce

May be hard to reproduce in-lab as the issue may be related to the cardinality of session_connections.

Expected behavior

No sustained database load at 25%. Response time when listing targets and sessions under a second (thought they may not be related to the same issue).

Additional context

After further testing with Hashicorp Engineers, we noticed the average cpu utilization of the DB instance was reduced from ~25% to ~15% after the creation of the following index:

CREATE INDEX session_connection_worker_id_closed_reason_ix
    ON session_connection (worker_id, closed_reason);

after_index

Session-listing times weren't affected by this change and remained in the range between 8 and 9 seconds (for an admin with access to all Targets).

We also noticed that due to the type of application and protocol being proxied through Boundary some sessions end up having a lot of "child" connections:

connection count session_id
70065 s_9a0SwaVQCg
48917 s_8CkGLFdRcc
26352 s_yijVTcLzay
11630 s_sEx10tttB2
5199 s_F5g0UKDvJU
2674 s_FjQEVDdIQy
2010 s_t4aym74hI0
1394 s_e7Xv80ML8o
1291 s_drJvt6vmf0
1026 s_2mTziw0ZjZ
457 s_BF3RcTzwbh
453 s_kMMBo4hvnm
192 s_dHwnZcsAL1
73 s_zDePrYWCxe
59 s_b2IzH36QX1
29 s_ejWlUr9snm
20 s_hHC9zUhIo9
18 s_bL9n4kyCXL
14 s_8GDqI1HkfR
12 s_u3kKyfKgZ1
11 s_BvDCtwmH0y
10 s_NzXmP9hljv
10 s_0R32kh9078
9 s_XwtjPNsJJr
8 s_8NUlPF6CG2
7 s_FXdW77rvXU
3 s_CDPTJ4AKJG
2 s_iOThqDP1rf
2 s_NCa0pfi7bl
2 s_guv25zIa6u
2 s_7lFUoV3vWO
1 s_USd522f6m5
1 s_qhSujJTedE
1 s_2m6TuxF8UL
tmessi commented 1 year ago

Reopening this for now. The index created in #3283 helps with some of the queries, but there are some others that need to be followed up on.

macmiranda commented 1 year ago

Short update: we were able to reduce the load on the DB even further after applying the 2 linked PR fixes by dropping old sessions from the session table that weren't connected to any existing project or scope. In fact, we just looked for really old sessions (that should have been closed anyway because of their max TTL but weren't) that didn't show up when running boundary sessions list -scope-id global -recursive as an admin. The Boundary database now runs on a db.t4g.small with ~10% average load compared to ~27% on a db.t4g.xlarge when we first started investigating the issue. Despite not knowing how exactly the dangling sessions came into existence, we feel the issue can be considered resolved for now. If we experience similar symptoms in the future, we'll open another issue and reference this one.

Thanks @tmessi, @mgaffney and all other Hashicorp people involved.