hashicorp / boundary

Boundary enables identity-based access management for dynamic infrastructure.
https://boundaryproject.io
Other
3.85k stars 289 forks source link

Performance issues due to untidy session_connection table #3301

Open macmiranda opened 1 year ago

macmiranda commented 1 year ago

Describe the bug

For now very little is known about the problem. psql sessions break with

SSL SYSCALL error: EOF detected

The symptom only arises when a certain data pipeline runs 1000s of tests using dbt against a Postgres instance proxied via Boundary. It uses high concurrency (8 parallel jobs) and each job can create multiple session connections because of the way the underlying libpq parallelizes the queries, so connection churn may be a factor.

Another thing we noticed, when that happens, it affects the other sessions on the same worker too, and as far as I can tell, it may affect sessions of users connected to other workers (i.e. in different clusters) - it's hard to say for sure, but it seems that the dropped connections are all related (which makes me think that something is affecting the controllers).

Other than that, sessions and connections work just fine.

We're running Boundary version 0.12.2 on both controllers and workers, as Kubernetes StatefulSets.

To Reproduce

At a high level:

  1. Connect to a database target
  2. Run dbt with 1000s of tests, with high concurrency, against a single target
  3. Connection breaks after 30s-1m and also breaks other connections on the same worker. Allegedly even connections on other workers are affected.

Expected behavior

Stable connection even under heavy load and concurrency.

Additional context

We've have manually applied the fixes https://github.com/hashicorp/boundary/pull/3283 and https://github.com/hashicorp/boundary/pull/3280

I'm attaching logs from the workers where the machine that runs the data pipeline connects (data-workers.txt) and from the controllers and workers running on a different cluster (controllers+workers.txt). We noticed connection breakage on the workers of this cluster as well. The logs are JSON formatted (but extension was changed because Github didn't like .json) and span a 2-min interval in which we observed the issue. They are ordered from the most recent to the oldest.

I'd be happy to investigate the issue further and run any debugging commands or gather extra logs.

Thanks.

tmessi commented 1 year ago

Thanks for the detailed report. Looking at the logs I see a couple things. In the data-workers.txt there are a lot of errors due to rpc error: code = DeadlineExceeded desc = context deadline exceeded when the worker is trying to report status to the controllers, ie:

{
  "id": "Tmf5FUXdWQ",
  "source": "https://hashicorp.com/boundary/boundary-worker-1/worker",
  "specversion": "1.0",
  "type": "error",
  "data": {
    "error": "rpc error: code = DeadlineExceeded desc = context deadline exceeded",
    "error_fields": {},
    "id": "e_H1ZibNYS6F",
    "version": "v0.1",
    "op": "worker.(Worker).sendWorkerStatus",
    "info": {
      "msg": "error making status request to controller"
    }
  },
  "datacontentype": "application/cloudevents",
  "time": "2023-06-08T14:57:20.529216848Z"
}

If the worker is unable to report to the controller for too long, it will terminate all sessions and connections on the worker, this can also be seen in the logs:

{
  "id": "MUxazwTLgj",
  "source": "https://hashicorp.com/boundary/boundary-worker-1/worker",
  "specversion": "1.0",
  "type": "error",
  "data": {
    "error": "status error grace period has expired, canceling all sessions on worker",
    "error_fields": {},
    "id": "e_9aQdikjlBe",
    "version": "v0.1",
    "op": "worker.(Worker).sendWorkerStatus",
    "info": {
      "grace_period": 15000000000,
      "last_status_time": "2023-06-08 14:57:56.406564739 +0000 UTC m=+367.954064626"
    }
  },
  "datacontentype": "application/cloudevents",
  "time": "2023-06-08T14:58:17.791085825Z"
}

The rpc error: code = DeadlineExceeded desc = context deadline exceeded error indicates that the status request took too long to respond. So looking at the controller logs to see why it is taking so long, I can see errors on the controller when it is processing these status requests. There are two in particular:

{
  "id": "IrUdOUgOb1",
  "source": "https://hashicorp.com/boundary/boundary-controller-0/controller",
  "specversion": "1.0",
  "type": "error",
  "data": {
    "error": "session.(Repository).checkIfNotActive: db.Query: unknown, unknown: error #0: context deadline exceeded",
    "error_fields": {
      "Code": 0,
      "Msg": "",
      "Op": "session.(Repository).checkIfNotActive",
      "Wrapped": {
        "Code": 0,
        "Msg": "",
        "Op": "db.Query",
        "Wrapped": {}
      }
    },
    "id": "e_6Ri6sVmgzz",
    "version": "v0.1",
    "op": "session.(Repository).checkIfNotActive",
    "request_info": {
      "id": "gtraceid_aiMosV9g4KUbC54jCNjr",
      "method": "/controller.servers.services.v1.ServerCoordinationService/Status"
    }
  },
  "datacontentype": "application/cloudevents",
  "time": "2023-06-08T14:58:45.112276325Z"
}

and

{
  "id": "jdboc37S8o",
  "source": "https://hashicorp.com/boundary/boundary-controller-0/controller",
  "specversion": "1.0",
  "type": "error",
  "data": {
    "error": "session.(Repository).checkIfNotActive: error checking if sessions are no longer active: db.DoTx: unknown, unknown: error #0: dbw.Rollback: sql: transaction has already been committed or rolled back",
    "error_fields": {
      "Code": 0,
      "Msg": "error checking if sessions are no longer active",
      "Op": "session.(Repository).checkIfNotActive",
      "Wrapped": {
        "Code": 0,
        "Msg": "",
        "Op": "db.DoTx",
        "Wrapped": {}
      }
    },
    "id": "e_6Ri6sVmgzz",
    "version": "v0.1",
    "op": "session.(Repository).checkIfNotActive",
    "request_info": {
      "id": "gtraceid_aiMosV9g4KUbC54jCNjr",
      "method": "/controller.servers.services.v1.ServerCoordinationService/Status"
    }
  },
  "datacontentype": "application/cloudevents",
  "time": "2023-06-08T14:58:45.112614501Z"
}

In both cases it looks like the request timed out in the middle of running these queries. It is also worth noting that the closedOrphanedConnections query happens after the checkIfNoLongerActive query as part of processing a worker status request. And it will run the second query even if the first failed, so if the checkIfNoLongerActive query fails due to a timeout, I would expect to also see a timeout error for the closeOrphanedConnections query.

So this points to load issues on the database and/or the controllers. Would you be able to collect some resource utilization metrics for both the database and controllers? Specifically, for the controllers:

And for the database:

It looks like there are three queries that normally run as part of the processing the status report, so ones of particular interest would be queries like:

UPDATE "session_connection" SET "bytes_down"=$1,"bytes_up"=$2,"closed_reason"=$3 WHERE "public_id" = $4
select session_id, state
    from session_state ss
where
    (ss.state = 'canceling' or ss.state = 'terminated')
    and ss.end_time is null
-- Find connections that are not closed so we can reference those IDs
with
  unclosed_connections as (
    select connection_id
      from session_connection_state
    where
      -- It's the current state
      end_time is null
      -- Current state isn't closed state
      and state in ('authorized', 'connected')
      -- It's not in limbo between when it moved into this state and when
      -- it started being reported by the worker, which is roughly every
      -- 2-3 seconds
      and start_time < wt_sub_seconds_from_now(@worker_state_delay_seconds)
  ),
  connections_to_close as (
    select public_id
      from session_connection
     where
           -- Related to the worker that just reported to us
           worker_id = @worker_id
           -- Only unclosed ones
           and public_id in (select connection_id from unclosed_connections)
           -- These are connection IDs that just got reported to us by the given
           -- worker, so they should not be considered closed.
  )
update session_connection
   set
      closed_reason = 'system error'
 where
    public_id in (select public_id from connections_to_close)
returning public_id;

(This last one looks like one that we looked at as part of #3283)

But it is also possible that some other query might be contributing to the load issue, so if there are any other slow queries that would also be helpful.

macmiranda commented 1 year ago

Hi @tmessi,

Thanks for the thorough explanation. It all makes sense now. Here's what happened since the last time I wrote:

We hadn't realized until we ran the same pipelines through Boundary again yesterday that the issue was gone. Based on what you described above and the symptoms before and after this change, we are confident enough to say that the massive number of rows in session_connection was the culprit.

Now, while the worst has passed, a few things still intrigue me:

macmiranda commented 1 year ago

Found a similar closed issue, not sure if that may play a part here.

@irenarindos

tmessi commented 1 year ago

That does seem like some sessions are getting stuck in the canceling state, which would prevent those sessions and session connections from automatically getting deleted, since only sessions that are terminated will get deleted.

The controller should be running a query periodically to find any session that is either:

Can you check the controller logs for any of the following error messages that might indicate that this query is failing:

This last one would be expected during a normal shutdown, but might indicate an issue if it happens on a controller that is still running.

macmiranda commented 1 year ago

We had 1 occurrence of the following error in the last 7 days:

{
  "id": "3tPyXCDpyF",
  "source": "https://hashicorp.com/boundary/boundary-controller-1/controller",
  "specversion": "1.0",
  "type": "error",
  "data": {
    "error": "session.(Repository).TerminateCompletedSessions: db.DoTx: unknown, unknown: error #0: dbw.Begin: failed to connect to `host=boundary-postgres.internal.[REDACTED].com user=boundary_app database=boundary`: dial error (timeout: dial tcp 10.21.241.165:5432: connect: connection timed out)",
    "error_fields": {
      "Code": 0,
      "Msg": "",
      "Op": "session.(Repository).TerminateCompletedSessions",
      "Wrapped": {
        "Code": 0,
        "Msg": "",
        "Op": "db.DoTx",
        "Wrapped": {}
      }
    },
    "id": "e_Wi0AOmmRrJ",
    "version": "v0.1",
    "op": "controller.(Controller).startTerminateCompletedSessionsTicking",
    "info": {
      "msg": "error performing termination of completed sessions"
    }
  },
  "datacontentype": "application/cloudevents",
  "time": "2023-06-08T17:35:10.095189613Z"
}

(seems to point to temporary loss of connectivity)

No occurrences of error fetching repository for terminating completed sessions

No occurrences of terminating completed sessions ticking shutting down

Now, looking further into the 3 cases for session cleanup not working:

In fact, by running the following query

select
        session_id
    from
        session_connection
    where public_id in (
        select
            connection_id
        from
            session_connection_state
        where
            state != 'closed' and
            end_time is null
)

we get

session_id
s_XxQTY48IPy
s_NnzbgdEzfn
s_Q1wA3hKiaZ
s_bLnLrQyYjZ
s_vnxMjNRXfI
s_bLnLrQyYjZ
s_XxQTY48IPy
s_pD1b6GA9KN
s_vnxMjNRXfI

In bold the session ids mentioned in my last message. So the question now is rather: how can connections still be considered active long after the session was set to expire? Does it really make sense to wait (and for how long)?

And just an afterthought: is the unknown closed_reason something I should worry about? Under what circumstances would that happen? Sorry to ask, I would check the source code if I had more time to investigate.

tmessi commented 1 year ago

I don't think the unknown reason is related, nor cause for concern. It seems that just what is always sent as the reason when the worker reports that the connection is closed. This is probably a separate bug, since I would expect closed by end-user. If you don't mind opening a separate bug for that so it can more easily be tracked, that would be helpful.

So the question now is rather: how can connections still be considered active long after the session was set to expire?

That is unexpected. Would you mind running the following query to get some additional data about the sessions in this state?

with
canceling_sessions as (
        select s.public_id as session_id, s.expiration_time, s.connection_limit
          from session as s
          join session_state as ss on s.public_id = ss.session_id
         where ss.state = 'canceling'
           and ss.end_time is null
),
closed_connections as (
         select public_id, session_id, worker_id
           from session_connection sc
           join session_connection_state scs on sc.public_id = scs.connection_id
          where end_time is null
            and scs.state = 'closed'
),
open_connections as (
         select public_id, session_id, worker_id
           from session_connection sc
           join session_connection_state scs on sc.public_id = scs.connection_id
          where end_time is null
            and scs.state != 'closed'
)
         select cs.session_id, cs.expiration_time < now() as expired,
                cs.connection_limit, count(open.public_id) as open_connections,
                count(closed.public_id) as closed_connections,
                open.worker_id as open_worker_id,
                closed.worker_id as closed_worker_id
           from canceling_sessions as cs
left outer join closed_connections as closed on cs.session_id = closed.session_id
left outer join open_connections   as open   on cs.session_id = open.session_id
       group by cs.session_id, cs.expiration_time, cs.connection_limit, closed.worker_id, open.worker_id
       order by cs.expiration_time desc;

And for any of the worker_ids returned that have the open connections and an expired session. Would you be able to check their logs for any mention of the corresponding session ids?

macmiranda commented 1 year ago

The query returned

session_id expired connection_limit open_connections closed_connections open_worker_id closed_worker_id
s_pD1b6GA9KN TRUE -1 1062 1062 w_pv9puJL6lT w_sy0swTMhaT
s_pD1b6GA9KN TRUE -1 899 899 w_pv9puJL6lT w_pv9puJL6lT
s_bLnLrQyYjZ TRUE -1 864 864 w_sy0swTMhaT  
s_bLnLrQyYjZ TRUE -1 4002 4002 w_sy0swTMhaT w_sy0swTMhaT
s_vnxMjNRXfI TRUE -1 1988 1988 w_sy0swTMhaT  
s_vnxMjNRXfI TRUE -1 1076 1076 w_sy0swTMhaT w_sy0swTMhaT
s_XxQTY48IPy TRUE -1 718 718 w_R2hr2pbbhx w_R2hr2pbbhx
s_XxQTY48IPy TRUE -1 718 718 w_jfXZ7BVjIM w_R2hr2pbbhx
s_XxQTY48IPy TRUE -1 721 721 w_jfXZ7BVjIM w_gdEpHsDfNw
s_XxQTY48IPy TRUE -1 721 721 w_R2hr2pbbhx w_gdEpHsDfNw
s_XxQTY48IPy TRUE -1 726 726 w_jfXZ7BVjIM w_jfXZ7BVjIM
s_XxQTY48IPy TRUE -1 726 726 w_R2hr2pbbhx w_jfXZ7BVjIM

So I checked logs for s_XxQTY48IPy and s_bLnLrQyYjZ and checked for errors around the same time in the same cluster/namespace including for other workers (respectively s_XxQTY48IPy-errors.txt and s_bLnLrQyYjZ-errors.txt).

Is it possible that the worker terminated connection due to status grace period expiration but because it wasn't able to immediately send that information to the controller, the controller and worker states diverged? Would the worker and controller reconcile all active and terminated connections once the communication is restored? Is there any sort of acknowledgement when that happens or is it on a best-effort basis?

tmessi commented 1 year ago

Would the worker and controller reconcile all active and terminated connections once the communication is restored? Is there any sort of acknowledgement when that happens or is it on a best-effort basis?

If the worker is able to reconnect, things should reconcile. And if the worker is down for too long, the controller should mark all the connections as closed, but it appears that something is not reconciling as expected here. Did these workers eventually reconnect? The last thing I see in both logs is a error making status request to controller. You can check with:

select public_id, last_status_time, operational_state
  from server_worker;

And those previous results look a little odd, can you try this query as well:

  select s.public_id, count(*), sc.worker_id, scs.state
    from session as s
    join session_connection as sc on s.public_id = sc.session_id
    join session_connection_state as scs on sc.public_id = scs.connection_id
   where scs.end_time is null
     and s.public_id in ('s_XxQTY48IPy', 's_vnxMjNRXfI', 's_bLnLrQyYjZ', 's_pD1b6GA9KN')
group by (s.public_id, sc.worker_id, scs.state);
macmiranda commented 1 year ago

Did these workers eventually reconnect?

Yes. To my surprise they are actually active. I thought that because we run workers on Kubernetes, they had been replaced during an upgrade but these are still active.

public_id last_status_time operational_state
w_pv9puJL6lT 2023-06-09 07:49:41.125531+00 shutdown
w_zBk6zmPQyN 2023-04-25 13:15:50.685821+00 shutdown
w_Q0NJVW1opt 2023-04-25 13:15:51.372905+00 shutdown
w_Pxx5LR88cz 2023-05-24 08:43:39.203868+00 shutdown
w_r1VICDA7Gc 2023-06-15 17:49:37.920326+00 active
w_iry3DfV7Uk 2023-06-15 17:49:38.166853+00 active
w_sy0swTMhaT 2023-06-15 17:49:38.472904+00 active
w_EEFoTpGPWq 2023-06-15 17:49:38.918738+00 active
w_aH0U367xh8 2023-06-15 17:49:39.455784+00 active
w_jfXZ7BVjIM 2023-06-15 17:49:39.724975+00 active
w_zoZg8pwrjw 2023-06-15 17:49:37.556514+00 active
w_xXjkQ05Etg 2023-06-15 17:49:38.438825+00 active
w_Abv8I5UcIq 2023-06-15 17:49:38.972074+00 active
w_gdEpHsDfNw 2023-06-15 17:49:39.678872+00 active
w_ZoOsCY8yhM 2023-05-03 10:42:50.29001+00 shutdown
w_U1fdrsXWxU 2023-06-15 17:49:37.986251+00 active
w_Sk8AgZ1dgc 2023-06-15 17:49:38.38113+00 active
w_fVQo4gewAr 2023-06-15 17:49:39.359341+00 active
w_yVFcvg9B2b 2023-05-03 10:42:55.998853+00 shutdown
w_R2hr2pbbhx 2023-06-15 17:49:38.779715+00 active

The second query returns

public_id count worker_id state
s_bLnLrQyYjZ 2001 w_sy0swTMhaT closed
s_bLnLrQyYjZ 2 w_sy0swTMhaT connected
s_bLnLrQyYjZ 432   closed
s_pD1b6GA9KN 899 w_pv9puJL6lT closed
s_pD1b6GA9KN 1 w_pv9puJL6lT connected
s_pD1b6GA9KN 1062 w_sy0swTMhaT closed
s_vnxMjNRXfI 538 w_sy0swTMhaT closed
s_vnxMjNRXfI 2 w_sy0swTMhaT connected
s_vnxMjNRXfI 994   closed
s_XxQTY48IPy 721 w_gdEpHsDfNw closed
s_XxQTY48IPy 726 w_jfXZ7BVjIM closed
s_XxQTY48IPy 1 w_jfXZ7BVjIM connected
s_XxQTY48IPy 718 w_R2hr2pbbhx closed
s_XxQTY48IPy 1 w_R2hr2pbbhx connected
macmiranda commented 1 year ago

So I thought the connections with state connected could be of special interest

public_id worker_id connection_id state
s_bLnLrQyYjZ w_sy0swTMhaT sc_8mNjmQZgLA connected
s_bLnLrQyYjZ w_sy0swTMhaT sc_M119RdYaku connected
s_pD1b6GA9KN w_pv9puJL6lT sc_Fj3UaEvTtS connected
s_vnxMjNRXfI w_sy0swTMhaT sc_Cj6zbEvrdw connected
s_vnxMjNRXfI w_sy0swTMhaT sc_ttyQDy1qCe connected
s_XxQTY48IPy w_jfXZ7BVjIM sc_tJO4CA3KAi connected
s_XxQTY48IPy w_R2hr2pbbhx sc_pIbwhteeq4 connected

and checked the logs for sc_8mNjmQZgLA as well as 10 log lines before and after the first time terminated connection due to status grace period expiration is logged.

macmiranda commented 1 year ago

Hi @tmessi,

anything else I can check that would help you understand this issue. If not, I'm thinking of manually removing those sessions again. I haven't noticed any new sessions besides those 4 that haven't been cleaned up on time (up until now).

AdamBouhmad commented 1 year ago

Hey @macmiranda, sorry for the delay. Tim is out on vaca but we're looking into this. Thanks for your patience ^^

tmessi commented 1 year ago

@macmiranda I think I have identified a scenario that could result in sessions connections being left in the connected state in the database. It would require a specific timing of events in which the worker is unable to report to the controller and then resumes communication. While you were experiencing the issues with #3281, I could see this scenario easily happening. However, I would expect that after resolving those issues, this would be a very rare scenario. Over the last few weeks, are you still seeing old connections in the connected state? And over that same time period have you see errors with the workers timing out when trying to send status?

Meanwhile I am still working on trying to recreate this to see if this scenario I have in mind is possible, and looking to see how best to address it.

macmiranda commented 1 year ago

Hi @tmessi,

It would require a specific timing of events in which the worker is unable to report to the controller and then resumes communication. While you were experiencing the issues with #3281, I could see this scenario easily happening.

Yes, it seems like that was the issue.

We have not seen any other dangling session connections since my last message. I'll go on and remove those 4 expired sessions manually which should remove the session connections as well.

As far as the logs go, I wasn't able to check them since we have no logs after upgrading to v0.13.0. Probably some change related to the default logging level... anyway, will check once we figure that out but I expect not to see status report errors anymore.

Thanks for your help!

macmiranda commented 1 year ago

Just to let you know, we are still seeing sendWorkerStatus errors but it seems the worker is handling it well, i.e. canceling the sessions, which I can see in the database with status canceled.

2023-07-10T01:07:21+02:00   {"id":"kdekmqmHu6","source":"https://hashicorp.com/boundary/boundary-worker-1/worker","specversion":"1.0","type":"error","data":{"error":"status error grace period has expired, canceling all sessions on worker","error_fields":{},"id":"e_eBrnba45R5","version":"v0.1","op":"worker.(Worker).sendWorkerStatus","info":{"grace_period":15000000000,"last_status_time":"2023-07-09 23:02:53.542054 +0000 UTC m=+193175.319334184"}},"datacontentype":"application/cloudevents","time":"2023-07-09T23:07:20.900761366Z"}
2023-07-10T01:07:20+02:00   {"id":"9WrT5DNy7E","source":"https://hashicorp.com/boundary/boundary-worker-1/worker","specversion":"1.0","type":"error","data":{"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded","error_fields":{},"id":"e_qa075SKWcn","version":"v0.1","op":"worker.(Worker).sendWorkerStatus","info":{"msg":"error making status request to controller"}},"datacontentype":"application/cloudevents","time":"2023-07-09T23:03:12.437104648Z"}
2023-07-09T22:07:17+02:00   {"id":"ZD3ShMPy6s","source":"https://hashicorp.com/boundary/boundary-worker-1/worker","specversion":"1.0","type":"error","data":{"error":"status error grace period has expired, canceling all sessions on worker","error_fields":{},"id":"e_7wgYedThQu","version":"v0.1","op":"worker.(Worker).sendWorkerStatus","info":{"grace_period":15000000000,"last_status_time":"2023-07-09 20:03:15.961878223 +0000 UTC m=+182397.739158399"}},"datacontentype":"application/cloudevents","time":"2023-07-09T20:04:17.897453496Z"}
2023-07-09T22:04:00+02:00   {"id":"Fcl7dnFa3g","source":"https://hashicorp.com/boundary/boundary-worker-1/worker","specversion":"1.0","type":"error","data":{"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded","error_fields":{},"id":"e_pBYHUHlny5","version":"v0.1","op":"worker.(Worker).sendWorkerStatus","info":{"msg":"error making status request to controller"}},"datacontentype":"application/cloudevents","time":"2023-07-09T20:03:31.31223578Z"}

Still not sure what may be causing these timeouts but I think we need to do some investigation on our end first.