Performance issues due to untidy session_connection table

macmiranda commented 1 year ago

Describe the bug

For now very little is known about the problem. psql sessions break with

SSL SYSCALL error: EOF detected

The symptom only arises when a certain data pipeline runs 1000s of tests using dbt against a Postgres instance proxied via Boundary. It uses high concurrency (8 parallel jobs) and each job can create multiple session connections because of the way the underlying libpq parallelizes the queries, so connection churn may be a factor.

Another thing we noticed, when that happens, it affects the other sessions on the same worker too, and as far as I can tell, it may affect sessions of users connected to other workers (i.e. in different clusters) - it's hard to say for sure, but it seems that the dropped connections are all related (which makes me think that something is affecting the controllers).

Other than that, sessions and connections work just fine.

We're running Boundary version 0.12.2 on both controllers and workers, as Kubernetes StatefulSets.

To Reproduce

At a high level:

Connect to a database target
Run dbt with 1000s of tests, with high concurrency, against a single target
Connection breaks after 30s-1m and also breaks other connections on the same worker. Allegedly even connections on other workers are affected.

Expected behavior

Stable connection even under heavy load and concurrency.

Additional context

We've have manually applied the fixes https://github.com/hashicorp/boundary/pull/3283 and https://github.com/hashicorp/boundary/pull/3280

I'm attaching logs from the workers where the machine that runs the data pipeline connects (data-workers.txt) and from the controllers and workers running on a different cluster (controllers+workers.txt). We noticed connection breakage on the workers of this cluster as well. The logs are JSON formatted (but extension was changed because Github didn't like .json) and span a 2-min interval in which we observed the issue. They are ordered from the most recent to the oldest.

I'd be happy to investigate the issue further and run any debugging commands or gather extra logs.

Thanks.

tmessi commented 1 year ago

Thanks for the detailed report. Looking at the logs I see a couple things. In the data-workers.txt there are a lot of errors due to rpc error: code = DeadlineExceeded desc = context deadline exceeded when the worker is trying to report status to the controllers, ie:

{
  "id": "Tmf5FUXdWQ",
  "source": "https://hashicorp.com/boundary/boundary-worker-1/worker",
  "specversion": "1.0",
  "type": "error",
  "data": {
    "error": "rpc error: code = DeadlineExceeded desc = context deadline exceeded",
    "error_fields": {},
    "id": "e_H1ZibNYS6F",
    "version": "v0.1",
    "op": "worker.(Worker).sendWorkerStatus",
    "info": {
      "msg": "error making status request to controller"
    }
  },
  "datacontentype": "application/cloudevents",
  "time": "2023-06-08T14:57:20.529216848Z"
}

If the worker is unable to report to the controller for too long, it will terminate all sessions and connections on the worker, this can also be seen in the logs:

{
  "id": "MUxazwTLgj",
  "source": "https://hashicorp.com/boundary/boundary-worker-1/worker",
  "specversion": "1.0",
  "type": "error",
  "data": {
    "error": "status error grace period has expired, canceling all sessions on worker",
    "error_fields": {},
    "id": "e_9aQdikjlBe",
    "version": "v0.1",
    "op": "worker.(Worker).sendWorkerStatus",
    "info": {
      "grace_period": 15000000000,
      "last_status_time": "2023-06-08 14:57:56.406564739 +0000 UTC m=+367.954064626"
    }
  },
  "datacontentype": "application/cloudevents",
  "time": "2023-06-08T14:58:17.791085825Z"
}

The rpc error: code = DeadlineExceeded desc = context deadline exceeded error indicates that the status request took too long to respond. So looking at the controller logs to see why it is taking so long, I can see errors on the controller when it is processing these status requests. There are two in particular:

{
  "id": "IrUdOUgOb1",
  "source": "https://hashicorp.com/boundary/boundary-controller-0/controller",
  "specversion": "1.0",
  "type": "error",
  "data": {
    "error": "session.(Repository).checkIfNotActive: db.Query: unknown, unknown: error #0: context deadline exceeded",
    "error_fields": {
      "Code": 0,
      "Msg": "",
      "Op": "session.(Repository).checkIfNotActive",
      "Wrapped": {
        "Code": 0,
        "Msg": "",
        "Op": "db.Query",
        "Wrapped": {}
      }
    },
    "id": "e_6Ri6sVmgzz",
    "version": "v0.1",
    "op": "session.(Repository).checkIfNotActive",
    "request_info": {
      "id": "gtraceid_aiMosV9g4KUbC54jCNjr",
      "method": "/controller.servers.services.v1.ServerCoordinationService/Status"
    }
  },
  "datacontentype": "application/cloudevents",
  "time": "2023-06-08T14:58:45.112276325Z"
}

and

{
  "id": "jdboc37S8o",
  "source": "https://hashicorp.com/boundary/boundary-controller-0/controller",
  "specversion": "1.0",
  "type": "error",
  "data": {
    "error": "session.(Repository).checkIfNotActive: error checking if sessions are no longer active: db.DoTx: unknown, unknown: error #0: dbw.Rollback: sql: transaction has already been committed or rolled back",
    "error_fields": {
      "Code": 0,
      "Msg": "error checking if sessions are no longer active",
      "Op": "session.(Repository).checkIfNotActive",
      "Wrapped": {
        "Code": 0,
        "Msg": "",
        "Op": "db.DoTx",
        "Wrapped": {}
      }
    },
    "id": "e_6Ri6sVmgzz",
    "version": "v0.1",
    "op": "session.(Repository).checkIfNotActive",
    "request_info": {
      "id": "gtraceid_aiMosV9g4KUbC54jCNjr",
      "method": "/controller.servers.services.v1.ServerCoordinationService/Status"
    }
  },
  "datacontentype": "application/cloudevents",
  "time": "2023-06-08T14:58:45.112614501Z"
}

In both cases it looks like the request timed out in the middle of running these queries. It is also worth noting that the closedOrphanedConnections query happens after the checkIfNoLongerActive query as part of processing a worker status request. And it will run the second query even if the first failed, so if the checkIfNoLongerActive query fails due to a timeout, I would expect to also see a timeout error for the closeOrphanedConnections query.

So this points to load issues on the database and/or the controllers. Would you be able to collect some resource utilization metrics for both the database and controllers? Specifically, for the controllers:

CPU utilization
Memory utilization

And for the database:

CPU utilization
Memory utilization
Wait Events
Slow queries

It looks like there are three queries that normally run as part of the processing the status report, so ones of particular interest would be queries like:

UPDATE "session_connection" SET "bytes_down"=$1,"bytes_up"=$2,"closed_reason"=$3 WHERE "public_id" = $4

select session_id, state
    from session_state ss
where
    (ss.state = 'canceling' or ss.state = 'terminated')
    and ss.end_time is null

-- Find connections that are not closed so we can reference those IDs
with
  unclosed_connections as (
    select connection_id
      from session_connection_state
    where
      -- It's the current state
      end_time is null
      -- Current state isn't closed state
      and state in ('authorized', 'connected')
      -- It's not in limbo between when it moved into this state and when
      -- it started being reported by the worker, which is roughly every
      -- 2-3 seconds
      and start_time < wt_sub_seconds_from_now(@worker_state_delay_seconds)
  ),
  connections_to_close as (
    select public_id
      from session_connection
     where
           -- Related to the worker that just reported to us
           worker_id = @worker_id
           -- Only unclosed ones
           and public_id in (select connection_id from unclosed_connections)
           -- These are connection IDs that just got reported to us by the given
           -- worker, so they should not be considered closed.
  )
update session_connection
   set
      closed_reason = 'system error'
 where
    public_id in (select public_id from connections_to_close)
returning public_id;

(This last one looks like one that we looked at as part of #3283)

But it is also possible that some other query might be contributing to the load issue, so if there are any other slow queries that would also be helpful.

macmiranda commented 1 year ago

Hi @tmessi,

Thanks for the thorough explanation. It all makes sense now. Here's what happened since the last time I wrote:

In a desperate attempt to fix the issue I have purged old sessions from the session table. That resulted in nearly 150,000 rows being dropped from session_connection 😱. You can see how that affected the database resource consumption below:

(note: the instance was downsized twice in this period which explains less freeable memory) (note: disregard the peak on Jun 8th, that was intentional)

(note: the Top SQL query is still the same one but generating a much lower load - you can see the previous load levels here)

We hadn't realized until we ran the same pipelines through Boundary again yesterday that the issue was gone. Based on what you described above and the symptoms before and after this change, we are confident enough to say that the massive number of rows in session_connection was the culprit.

Now, while the worst has passed, a few things still intrigue me:

Why did we have so many rows in the session_connection table even though the referenced sessions were supposed to have expired a long time ago (~2 months) or better saying, why were the sessions themselves never dropped? In fact, I have noticed 4 recent sessions that should have been cleaned up but still show a canceling status (it's been like this for a few days already).

Session information:
ID:                    s_XxQTY48IPy
  Scope ID:            p_wxthS58rtr
  Status:              canceling
  Created Time:        Tue, 06 Jun 2023 11:31:46 CEST
  Expiration Time:     Sun, 11 Jun 2023 11:31:46 CEST
  Updated Time:        Thu, 08 Jun 2023 15:09:19 CEST
  User ID:             u_T5FoMKqLF2
  Target ID:           ttcp_kN45AtC2Lt
  Authorized Actions:
    no-op
    read
    read:self
    cancel
    cancel:self

ID:                    s_vnxMjNRXfI
  Scope ID:            p_6tbahFF27g
  Status:              canceling
  Created Time:        Wed, 07 Jun 2023 17:32:43 CEST
  Expiration Time:     Mon, 12 Jun 2023 17:32:43 CEST
  Updated Time:        Fri, 09 Jun 2023 11:02:24 CEST
  User ID:             u_QRrP5hlymF
  Authorized Actions:
    no-op
    read
    read:self
    cancel
    cancel:self

ID:                    s_bLnLrQyYjZ
  Scope ID:            p_6tbahFF27g
  Status:              canceling
  Created Time:        Wed, 07 Jun 2023 17:58:02 CEST
  Expiration Time:     Mon, 12 Jun 2023 17:58:02 CEST
  Updated Time:        Fri, 09 Jun 2023 11:02:24 CEST
  User ID:             u_QRrP5hlymF
  Authorized Actions:
    no-op
    read
    read:self
    cancel
    cancel:self

ID:                    s_pD1b6GA9KN
  Scope ID:            p_6tbahFF27g
  Status:              canceling
  Created Time:        Thu, 08 Jun 2023 16:05:53 CEST
  Expiration Time:     Tue, 13 Jun 2023 16:05:53 CEST
  Updated Time:        Fri, 09 Jun 2023 11:02:24 CEST
  User ID:             u_QRrP5hlymF
  Authorized Actions:
    no-op
    read
    read:self
    cancel
    cancel:self

Is it coincidence that these sessions also seem to be the ones with most connections?

count session_id

2435 s_bLnLrQyYjZ

2328 s_NnzbgdEzfn

2167 s_XxQTY48IPy

1962 s_pD1b6GA9KN

1534 s_vnxMjNRXfI

197 s_eJKtcbifIu

152 s_051rdjAMNL

151 s_znu9b68yt0
3 out of the 4 sessions have NULL as target_id and the other has NULL as auth_token_id
All our connections are terminated with closed_reason unknown, except for a very small number of dangling connections closed by the SQL query that looks for unclosed connections, which have the value system error.

count	session_id
2435	s_bLnLrQyYjZ
2328	s_NnzbgdEzfn
2167	s_XxQTY48IPy
1962	s_pD1b6GA9KN
1534	s_vnxMjNRXfI
197	s_eJKtcbifIu
152	s_051rdjAMNL
151	s_znu9b68yt0

macmiranda commented 1 year ago

Found a similar closed issue, not sure if that may play a part here.

@irenarindos

tmessi commented 1 year ago

That does seem like some sessions are getting stuck in the canceling state, which would prevent those sessions and session connections from automatically getting deleted, since only sessions that are terminated will get deleted.

The controller should be running a query periodically to find any session that is either:

expired and all connections are closed
reached the connection limit and all connection are closed
or in the canceling state and all connections are closed.

Can you check the controller logs for any of the following error messages that might indicate that this query is failing:

error performing termination of completed sessions
error fetching repository for terminating completed sessions
terminating completed sessions ticking shutting down

This last one would be expected during a normal shutdown, but might indicate an issue if it happens on a controller that is still running.

macmiranda commented 1 year ago

We had 1 occurrence of the following error in the last 7 days:

{
  "id": "3tPyXCDpyF",
  "source": "https://hashicorp.com/boundary/boundary-controller-1/controller",
  "specversion": "1.0",
  "type": "error",
  "data": {
    "error": "session.(Repository).TerminateCompletedSessions: db.DoTx: unknown, unknown: error #0: dbw.Begin: failed to connect to `host=boundary-postgres.internal.[REDACTED].com user=boundary_app database=boundary`: dial error (timeout: dial tcp 10.21.241.165:5432: connect: connection timed out)",
    "error_fields": {
      "Code": 0,
      "Msg": "",
      "Op": "session.(Repository).TerminateCompletedSessions",
      "Wrapped": {
        "Code": 0,
        "Msg": "",
        "Op": "db.DoTx",
        "Wrapped": {}
      }
    },
    "id": "e_Wi0AOmmRrJ",
    "version": "v0.1",
    "op": "controller.(Controller).startTerminateCompletedSessionsTicking",
    "info": {
      "msg": "error performing termination of completed sessions"
    }
  },
  "datacontentype": "application/cloudevents",
  "time": "2023-06-08T17:35:10.095189613Z"
}

(seems to point to temporary loss of connectivity)

No occurrences of error fetching repository for terminating completed sessions

No occurrences of terminating completed sessions ticking shutting down

Now, looking further into the 3 cases for session cleanup not working:

expired and all connections are closed
- this is not the case
reached the connection limit and all connection are closed
- we do not use connection limit
or in the canceling state and all connections are closed
- seems to be the only possibility

In fact, by running the following query

select
        session_id
    from
        session_connection
    where public_id in (
        select
            connection_id
        from
            session_connection_state
        where
            state != 'closed' and
            end_time is null
)

we get

session_id
s_XxQTY48IPy
s_NnzbgdEzfn
s_Q1wA3hKiaZ
s_bLnLrQyYjZ
s_vnxMjNRXfI
s_bLnLrQyYjZ
s_XxQTY48IPy
s_pD1b6GA9KN
s_vnxMjNRXfI

In bold the session ids mentioned in my last message. So the question now is rather: how can connections still be considered active long after the session was set to expire? Does it really make sense to wait (and for how long)?

And just an afterthought: is the unknown closed_reason something I should worry about? Under what circumstances would that happen? Sorry to ask, I would check the source code if I had more time to investigate.

tmessi commented 1 year ago

I don't think the unknown reason is related, nor cause for concern. It seems that just what is always sent as the reason when the worker reports that the connection is closed. This is probably a separate bug, since I would expect closed by end-user. If you don't mind opening a separate bug for that so it can more easily be tracked, that would be helpful.

So the question now is rather: how can connections still be considered active long after the session was set to expire?

That is unexpected. Would you mind running the following query to get some additional data about the sessions in this state?

with
canceling_sessions as (
        select s.public_id as session_id, s.expiration_time, s.connection_limit
          from session as s
          join session_state as ss on s.public_id = ss.session_id
         where ss.state = 'canceling'
           and ss.end_time is null
),
closed_connections as (
         select public_id, session_id, worker_id
           from session_connection sc
           join session_connection_state scs on sc.public_id = scs.connection_id
          where end_time is null
            and scs.state = 'closed'
),
open_connections as (
         select public_id, session_id, worker_id
           from session_connection sc
           join session_connection_state scs on sc.public_id = scs.connection_id
          where end_time is null
            and scs.state != 'closed'
)
         select cs.session_id, cs.expiration_time < now() as expired,
                cs.connection_limit, count(open.public_id) as open_connections,
                count(closed.public_id) as closed_connections,
                open.worker_id as open_worker_id,
                closed.worker_id as closed_worker_id
           from canceling_sessions as cs
left outer join closed_connections as closed on cs.session_id = closed.session_id
left outer join open_connections   as open   on cs.session_id = open.session_id
       group by cs.session_id, cs.expiration_time, cs.connection_limit, closed.worker_id, open.worker_id
       order by cs.expiration_time desc;

And for any of the worker_ids returned that have the open connections and an expired session. Would you be able to check their logs for any mention of the corresponding session ids?

macmiranda commented 1 year ago

The query returned

session_id	expired	connection_limit	open_connections	closed_connections	open_worker_id	closed_worker_id
s_pD1b6GA9KN	TRUE	-1	1062	1062	w_pv9puJL6lT	w_sy0swTMhaT
s_pD1b6GA9KN	TRUE	-1	899	899	w_pv9puJL6lT	w_pv9puJL6lT
s_bLnLrQyYjZ	TRUE	-1	864	864	w_sy0swTMhaT
s_bLnLrQyYjZ	TRUE	-1	4002	4002	w_sy0swTMhaT	w_sy0swTMhaT
s_vnxMjNRXfI	TRUE	-1	1988	1988	w_sy0swTMhaT
s_vnxMjNRXfI	TRUE	-1	1076	1076	w_sy0swTMhaT	w_sy0swTMhaT
s_XxQTY48IPy	TRUE	-1	718	718	w_R2hr2pbbhx	w_R2hr2pbbhx
s_XxQTY48IPy	TRUE	-1	718	718	w_jfXZ7BVjIM	w_R2hr2pbbhx
s_XxQTY48IPy	TRUE	-1	721	721	w_jfXZ7BVjIM	w_gdEpHsDfNw
s_XxQTY48IPy	TRUE	-1	721	721	w_R2hr2pbbhx	w_gdEpHsDfNw
s_XxQTY48IPy	TRUE	-1	726	726	w_jfXZ7BVjIM	w_jfXZ7BVjIM
s_XxQTY48IPy	TRUE	-1	726	726	w_R2hr2pbbhx	w_jfXZ7BVjIM

So I checked logs for s_XxQTY48IPy and s_bLnLrQyYjZ and checked for errors around the same time in the same cluster/namespace including for other workers (respectively s_XxQTY48IPy-errors.txt and s_bLnLrQyYjZ-errors.txt).

Is it possible that the worker terminated connection due to status grace period expiration but because it wasn't able to immediately send that information to the controller, the controller and worker states diverged? Would the worker and controller reconcile all active and terminated connections once the communication is restored? Is there any sort of acknowledgement when that happens or is it on a best-effort basis?

tmessi commented 1 year ago

Would the worker and controller reconcile all active and terminated connections once the communication is restored? Is there any sort of acknowledgement when that happens or is it on a best-effort basis?

If the worker is able to reconnect, things should reconcile. And if the worker is down for too long, the controller should mark all the connections as closed, but it appears that something is not reconciling as expected here. Did these workers eventually reconnect? The last thing I see in both logs is a error making status request to controller. You can check with:

select public_id, last_status_time, operational_state
  from server_worker;

And those previous results look a little odd, can you try this query as well:

  select s.public_id, count(*), sc.worker_id, scs.state
    from session as s
    join session_connection as sc on s.public_id = sc.session_id
    join session_connection_state as scs on sc.public_id = scs.connection_id
   where scs.end_time is null
     and s.public_id in ('s_XxQTY48IPy', 's_vnxMjNRXfI', 's_bLnLrQyYjZ', 's_pD1b6GA9KN')
group by (s.public_id, sc.worker_id, scs.state);

macmiranda commented 1 year ago

Did these workers eventually reconnect?

Yes. To my surprise they are actually active. I thought that because we run workers on Kubernetes, they had been replaced during an upgrade but these are still active.

public_id	last_status_time	operational_state
w_pv9puJL6lT	2023-06-09 07:49:41.125531+00	shutdown
w_zBk6zmPQyN	2023-04-25 13:15:50.685821+00	shutdown
w_Q0NJVW1opt	2023-04-25 13:15:51.372905+00	shutdown
w_Pxx5LR88cz	2023-05-24 08:43:39.203868+00	shutdown
w_r1VICDA7Gc	2023-06-15 17:49:37.920326+00	active
w_iry3DfV7Uk	2023-06-15 17:49:38.166853+00	active
w_sy0swTMhaT	2023-06-15 17:49:38.472904+00	active
w_EEFoTpGPWq	2023-06-15 17:49:38.918738+00	active
w_aH0U367xh8	2023-06-15 17:49:39.455784+00	active
w_jfXZ7BVjIM	2023-06-15 17:49:39.724975+00	active
w_zoZg8pwrjw	2023-06-15 17:49:37.556514+00	active
w_xXjkQ05Etg	2023-06-15 17:49:38.438825+00	active
w_Abv8I5UcIq	2023-06-15 17:49:38.972074+00	active
w_gdEpHsDfNw	2023-06-15 17:49:39.678872+00	active
w_ZoOsCY8yhM	2023-05-03 10:42:50.29001+00	shutdown
w_U1fdrsXWxU	2023-06-15 17:49:37.986251+00	active
w_Sk8AgZ1dgc	2023-06-15 17:49:38.38113+00	active
w_fVQo4gewAr	2023-06-15 17:49:39.359341+00	active
w_yVFcvg9B2b	2023-05-03 10:42:55.998853+00	shutdown
w_R2hr2pbbhx	2023-06-15 17:49:38.779715+00	active

The second query returns

public_id	count	worker_id	state
s_bLnLrQyYjZ	2001	w_sy0swTMhaT	closed
s_bLnLrQyYjZ	2	w_sy0swTMhaT	connected
s_bLnLrQyYjZ	432		closed
s_pD1b6GA9KN	899	w_pv9puJL6lT	closed
s_pD1b6GA9KN	1	w_pv9puJL6lT	connected
s_pD1b6GA9KN	1062	w_sy0swTMhaT	closed
s_vnxMjNRXfI	538	w_sy0swTMhaT	closed
s_vnxMjNRXfI	2	w_sy0swTMhaT	connected
s_vnxMjNRXfI	994		closed
s_XxQTY48IPy	721	w_gdEpHsDfNw	closed
s_XxQTY48IPy	726	w_jfXZ7BVjIM	closed
s_XxQTY48IPy	1	w_jfXZ7BVjIM	connected
s_XxQTY48IPy	718	w_R2hr2pbbhx	closed
s_XxQTY48IPy	1	w_R2hr2pbbhx	connected

macmiranda commented 1 year ago

So I thought the connections with state connected could be of special interest

public_id	worker_id	connection_id	state
s_bLnLrQyYjZ	w_sy0swTMhaT	sc_8mNjmQZgLA	connected
s_bLnLrQyYjZ	w_sy0swTMhaT	sc_M119RdYaku	connected
s_pD1b6GA9KN	w_pv9puJL6lT	sc_Fj3UaEvTtS	connected
s_vnxMjNRXfI	w_sy0swTMhaT	sc_Cj6zbEvrdw	connected
s_vnxMjNRXfI	w_sy0swTMhaT	sc_ttyQDy1qCe	connected
s_XxQTY48IPy	w_jfXZ7BVjIM	sc_tJO4CA3KAi	connected
s_XxQTY48IPy	w_R2hr2pbbhx	sc_pIbwhteeq4	connected

and checked the logs for sc_8mNjmQZgLA as well as 10 log lines before and after the first time terminated connection due to status grace period expiration is logged.

macmiranda commented 1 year ago

Hi @tmessi,

anything else I can check that would help you understand this issue. If not, I'm thinking of manually removing those sessions again. I haven't noticed any new sessions besides those 4 that haven't been cleaned up on time (up until now).

AdamBouhmad commented 1 year ago

Hey @macmiranda, sorry for the delay. Tim is out on vaca but we're looking into this. Thanks for your patience ^^

tmessi commented 1 year ago

@macmiranda I think I have identified a scenario that could result in sessions connections being left in the connected state in the database. It would require a specific timing of events in which the worker is unable to report to the controller and then resumes communication. While you were experiencing the issues with #3281, I could see this scenario easily happening. However, I would expect that after resolving those issues, this would be a very rare scenario. Over the last few weeks, are you still seeing old connections in the connected state? And over that same time period have you see errors with the workers timing out when trying to send status?

Meanwhile I am still working on trying to recreate this to see if this scenario I have in mind is possible, and looking to see how best to address it.

macmiranda commented 1 year ago

Hi @tmessi,

It would require a specific timing of events in which the worker is unable to report to the controller and then resumes communication. While you were experiencing the issues with #3281, I could see this scenario easily happening.

Yes, it seems like that was the issue.

We have not seen any other dangling session connections since my last message. I'll go on and remove those 4 expired sessions manually which should remove the session connections as well.

As far as the logs go, I wasn't able to check them since we have no logs after upgrading to v0.13.0. Probably some change related to the default logging level... anyway, will check once we figure that out but I expect not to see status report errors anymore.

Thanks for your help!

macmiranda commented 1 year ago

Just to let you know, we are still seeing sendWorkerStatus errors but it seems the worker is handling it well, i.e. canceling the sessions, which I can see in the database with status canceled.

2023-07-10T01:07:21+02:00   {"id":"kdekmqmHu6","source":"https://hashicorp.com/boundary/boundary-worker-1/worker","specversion":"1.0","type":"error","data":{"error":"status error grace period has expired, canceling all sessions on worker","error_fields":{},"id":"e_eBrnba45R5","version":"v0.1","op":"worker.(Worker).sendWorkerStatus","info":{"grace_period":15000000000,"last_status_time":"2023-07-09 23:02:53.542054 +0000 UTC m=+193175.319334184"}},"datacontentype":"application/cloudevents","time":"2023-07-09T23:07:20.900761366Z"}
2023-07-10T01:07:20+02:00   {"id":"9WrT5DNy7E","source":"https://hashicorp.com/boundary/boundary-worker-1/worker","specversion":"1.0","type":"error","data":{"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded","error_fields":{},"id":"e_qa075SKWcn","version":"v0.1","op":"worker.(Worker).sendWorkerStatus","info":{"msg":"error making status request to controller"}},"datacontentype":"application/cloudevents","time":"2023-07-09T23:03:12.437104648Z"}
2023-07-09T22:07:17+02:00   {"id":"ZD3ShMPy6s","source":"https://hashicorp.com/boundary/boundary-worker-1/worker","specversion":"1.0","type":"error","data":{"error":"status error grace period has expired, canceling all sessions on worker","error_fields":{},"id":"e_7wgYedThQu","version":"v0.1","op":"worker.(Worker).sendWorkerStatus","info":{"grace_period":15000000000,"last_status_time":"2023-07-09 20:03:15.961878223 +0000 UTC m=+182397.739158399"}},"datacontentype":"application/cloudevents","time":"2023-07-09T20:04:17.897453496Z"}
2023-07-09T22:04:00+02:00   {"id":"Fcl7dnFa3g","source":"https://hashicorp.com/boundary/boundary-worker-1/worker","specversion":"1.0","type":"error","data":{"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded","error_fields":{},"id":"e_pBYHUHlny5","version":"v0.1","op":"worker.(Worker).sendWorkerStatus","info":{"msg":"error making status request to controller"}},"datacontentype":"application/cloudevents","time":"2023-07-09T20:03:31.31223578Z"}

Still not sure what may be causing these timeouts but I think we need to do some investigation on our end first.

hashicorp / boundary

Performance issues due to untidy session_connection table #3301