hashicorp / boundary

Boundary enables identity-based access management for dynamic infrastructure.
https://boundaryproject.io
Other
3.84k stars 287 forks source link

Heavy load on PSQL database #1851

Closed emanuelebosetti closed 1 year ago

emanuelebosetti commented 2 years ago

Describe the bug After reaching about 1000 sessions registered in the session table the database becomes very overloaded and slow. This problem occurs both on version 0.7.4 and 0.6.2

To Reproduce Reach a number of ~ 1k in session table.

Expected behavior No heavy load on db

Additional context We have a deployment of 3 controller and about 600 worker geographically distributed. The DB is a Patroni PSQL cluster. I think that the problem is due to the large number of workers trying to check if there are any connections to close. Watching the active query on db it seems that there is slow query blocking the postgres. In detail the following query:

  with
    -- Find connections that are not closed so we can reference those IDs
    unclosed_connections as (
      select connection_id
        from session_connection_state
      where
        -- It's the current state
        end_time is null
          and
        -- Current state isn't closed state
        state in ('authorized', 'connected')
          and
        -- It's not in limbo between when it moved into this state and when
        -- it started being reported by the worker, which is roughly every
        -- 2-3 seconds
        start_time < wt_sub_seconds_from_now(10)
    ),
    connections_to_close as (
      select public_id
        from session_connection
      where
        -- Related to the worker that just reported to us
        server_id = $1
          and
        -- These are connection IDs that just got reported to us by the given
        -- worker, so they should not be considered closed.
        -- Only unclosed ones
        public_id in (select connection_id from unclosed_connections)
    )
    update session_connection
      set
 SELECT * FROM "session_state" WHERE session_id = $1 ORDER BY start_time desc LIMIT 10000

After adding some index, the load on database drastically decrease.

On session_connection table:

create index session_connection__creation_time
    on session_connection (create_time);

create index session_connection__session_id
    on session_connection (session_id);

create index session_connection__server_id
    on session_connection (server_id);

On session_connection_state table:

create index session_connection_state__start_time_state
    on session_connection_state (start_time, state);
the-maldridge commented 2 years ago

I'm running a significantly smaller deployment than you, but I'm also noticing painfully slow performance. Does this manifest in your environment as sluggish CLI/data fetch? Wondering if our two observed failure modes are related.

the-maldridge commented 2 years ago

Any word on this from any of the fine folks at Hashicorp?

tmessi commented 2 years ago

Thanks for this report. We are looking at this query and a few other related queries that were identified during an investigation into a dealock issue and plan on making some changes here.

We are also aware that there is more work to be done to add indexes for performance, see #1881.

the-maldridge commented 2 years ago

Interesting. It seems there is much room to speed up the database. In your opinion, what's the expected time to wait for a connection to init? Is the expectation that using web interfaces like Nomad's which make many many quick requests for backend data over boundary something that's reasonable or is that not really the intended use case for boundary?

giulio1979 commented 2 years ago

I managed to deploy the deadlock fix in my environment (0.7.6 version), the environment is much more stable but the DB is overloaded, the moment I reach 20 concurrent sessions Postgres uses my 16 CPU cores fully (server load is at 25). Scaling it to 1000 sessions looks impossible to me. I can give a shot to your indices, see if it helps

giulio1979 commented 2 years ago

unfortunately didn't help, disabling the WH_ triggers helps, dropped my CPU usage to 16% of a single core and improved bandwidth by a factor of 3

jefferai commented 2 years ago

@giulio1979 Have you had more luck with more recent releases? A lot of scaling work went into 0.8 and 0.9 so it'd be good to see if you see improvement.

covetocove commented 1 year ago

Closing as there's been no issues reported since the perf fixes that landed in 0.8 and 0.9