Closed emanuelebosetti closed 1 year ago
I'm running a significantly smaller deployment than you, but I'm also noticing painfully slow performance. Does this manifest in your environment as sluggish CLI/data fetch? Wondering if our two observed failure modes are related.
Any word on this from any of the fine folks at Hashicorp?
Thanks for this report. We are looking at this query and a few other related queries that were identified during an investigation into a dealock issue and plan on making some changes here.
We are also aware that there is more work to be done to add indexes for performance, see #1881.
Interesting. It seems there is much room to speed up the database. In your opinion, what's the expected time to wait for a connection to init? Is the expectation that using web interfaces like Nomad's which make many many quick requests for backend data over boundary something that's reasonable or is that not really the intended use case for boundary?
I managed to deploy the deadlock fix in my environment (0.7.6 version), the environment is much more stable but the DB is overloaded, the moment I reach 20 concurrent sessions Postgres uses my 16 CPU cores fully (server load is at 25). Scaling it to 1000 sessions looks impossible to me. I can give a shot to your indices, see if it helps
unfortunately didn't help, disabling the WH_ triggers helps, dropped my CPU usage to 16% of a single core and improved bandwidth by a factor of 3
@giulio1979 Have you had more luck with more recent releases? A lot of scaling work went into 0.8 and 0.9 so it'd be good to see if you see improvement.
Closing as there's been no issues reported since the perf fixes that landed in 0.8 and 0.9
Describe the bug After reaching about 1000 sessions registered in the session table the database becomes very overloaded and slow. This problem occurs both on version 0.7.4 and 0.6.2
To Reproduce Reach a number of ~ 1k in session table.
Expected behavior No heavy load on db
Additional context We have a deployment of 3 controller and about 600 worker geographically distributed. The DB is a Patroni PSQL cluster. I think that the problem is due to the large number of workers trying to check if there are any connections to close. Watching the active query on db it seems that there is slow query blocking the postgres. In detail the following query:
After adding some index, the load on database drastically decrease.
On
session_connection
table:On
session_connection_state
table: