Closed macmiranda closed 1 year ago
Reopening this for now. The index created in #3283 helps with some of the queries, but there are some others that need to be followed up on.
Short update: we were able to reduce the load on the DB even further after applying the 2 linked PR fixes by dropping old sessions from the session
table that weren't connected to any existing project or scope. In fact, we just looked for really old sessions (that should have been closed anyway because of their max TTL but weren't) that didn't show up when running boundary sessions list -scope-id global -recursive
as an admin.
The Boundary database now runs on a db.t4g.small
with ~10% average load compared to ~27% on a db.t4g.xlarge
when we first started investigating the issue.
Despite not knowing how exactly the dangling sessions came into existence, we feel the issue can be considered resolved for now. If we experience similar symptoms in the future, we'll open another issue and reference this one.
Thanks @tmessi, @mgaffney and all other Hashicorp people involved.
Describe the bug (as originally posted on Hashicorp Discuss)
Hi community,
We’ve been consistently experiencing high load on our Boundary RDS PostgreSQL database. The impact is clearly seen when a user opens Boundary Desktop and has to wait for 7 to 10 seconds to see a list of Targets they can connect to or a list of Sessions that are active/pending. Authorizing a new session also takes a couple of seconds and used to be immediate during the implementation phase. These waiting times are less noticeable when using Boundary CLI. (update: while using
curl
to list Targets, the response time was within an acceptable range 0.4-0.5s; listing Sessions though took between 8 and 9 seconds).The issue is somewhat similar to the one described here, supposedly fixed by enhancements in
0.8.x
and0.9.x
. We’re running0.12.2
.The DB instance is a
db.t4g.xlarge
with 4 vCPUs and 16GB RAM, which I would assume to be reasonably sized for a production deployment based on the reference architecture that uses adb.t3.micro
.Here you can see some stats of the top SQL commands by waits (the top query is the same pointed out in the Github issue above)
and performance metrics of the instance (notice 1 vCPU is consistently at max utilization i.e. ~25% of the total)
We have 3 controllers running in a single Kubernetes cluster. The workers run on 6 different clusters (3 per cluster, with a few exceptions). Both controllers and workers are exposed using Kubernetes Ingress (workers leverage an
ingress-nginx
feature to expose TCP ports).At any given time no more than 20 Boundary sessions are active across all clusters (since Boundary adoption is happening slowly). This was verified with the following SQL query (not sure if my interpretation of the result is correct though):
Although the number of sessions is quite small, the number of rows in the
session_connections
table is quite high (160k rows) and insession_connection_states
3x as much (understandably, since it tracks 3 different states for each connection).Some questions I have:
closed_reason
shows eitherunknown
orsystem error
which I find quite strange.session_connection
table would reduce the load on the server? What other alternatives are there? What are the downsides of doing so besides the loss of historical data?To Reproduce
May be hard to reproduce in-lab as the issue may be related to the cardinality of
session_connections
.Expected behavior
No sustained database load at 25%. Response time when listing targets and sessions under a second (thought they may not be related to the same issue).
Additional context
After further testing with Hashicorp Engineers, we noticed the average cpu utilization of the DB instance was reduced from ~25% to ~15% after the creation of the following index:
Session-listing times weren't affected by this change and remained in the range between 8 and 9 seconds (for an admin with access to all Targets).
We also noticed that due to the type of application and protocol being proxied through Boundary some sessions end up having a lot of "child" connections: