Open macmiranda opened 1 year ago
Thanks for the detailed report. Looking at the logs I see a couple things. In the data-workers.txt
there are a lot of errors due to rpc error: code = DeadlineExceeded desc = context deadline exceeded
when the worker is trying to report status to the controllers, ie:
{
"id": "Tmf5FUXdWQ",
"source": "https://hashicorp.com/boundary/boundary-worker-1/worker",
"specversion": "1.0",
"type": "error",
"data": {
"error": "rpc error: code = DeadlineExceeded desc = context deadline exceeded",
"error_fields": {},
"id": "e_H1ZibNYS6F",
"version": "v0.1",
"op": "worker.(Worker).sendWorkerStatus",
"info": {
"msg": "error making status request to controller"
}
},
"datacontentype": "application/cloudevents",
"time": "2023-06-08T14:57:20.529216848Z"
}
If the worker is unable to report to the controller for too long, it will terminate all sessions and connections on the worker, this can also be seen in the logs:
{
"id": "MUxazwTLgj",
"source": "https://hashicorp.com/boundary/boundary-worker-1/worker",
"specversion": "1.0",
"type": "error",
"data": {
"error": "status error grace period has expired, canceling all sessions on worker",
"error_fields": {},
"id": "e_9aQdikjlBe",
"version": "v0.1",
"op": "worker.(Worker).sendWorkerStatus",
"info": {
"grace_period": 15000000000,
"last_status_time": "2023-06-08 14:57:56.406564739 +0000 UTC m=+367.954064626"
}
},
"datacontentype": "application/cloudevents",
"time": "2023-06-08T14:58:17.791085825Z"
}
The rpc error: code = DeadlineExceeded desc = context deadline exceeded
error indicates that the status request took too long to respond. So looking at the controller logs to see why it is taking so long, I can see errors on the controller when it is processing these status requests. There are two in particular:
{
"id": "IrUdOUgOb1",
"source": "https://hashicorp.com/boundary/boundary-controller-0/controller",
"specversion": "1.0",
"type": "error",
"data": {
"error": "session.(Repository).checkIfNotActive: db.Query: unknown, unknown: error #0: context deadline exceeded",
"error_fields": {
"Code": 0,
"Msg": "",
"Op": "session.(Repository).checkIfNotActive",
"Wrapped": {
"Code": 0,
"Msg": "",
"Op": "db.Query",
"Wrapped": {}
}
},
"id": "e_6Ri6sVmgzz",
"version": "v0.1",
"op": "session.(Repository).checkIfNotActive",
"request_info": {
"id": "gtraceid_aiMosV9g4KUbC54jCNjr",
"method": "/controller.servers.services.v1.ServerCoordinationService/Status"
}
},
"datacontentype": "application/cloudevents",
"time": "2023-06-08T14:58:45.112276325Z"
}
and
{
"id": "jdboc37S8o",
"source": "https://hashicorp.com/boundary/boundary-controller-0/controller",
"specversion": "1.0",
"type": "error",
"data": {
"error": "session.(Repository).checkIfNotActive: error checking if sessions are no longer active: db.DoTx: unknown, unknown: error #0: dbw.Rollback: sql: transaction has already been committed or rolled back",
"error_fields": {
"Code": 0,
"Msg": "error checking if sessions are no longer active",
"Op": "session.(Repository).checkIfNotActive",
"Wrapped": {
"Code": 0,
"Msg": "",
"Op": "db.DoTx",
"Wrapped": {}
}
},
"id": "e_6Ri6sVmgzz",
"version": "v0.1",
"op": "session.(Repository).checkIfNotActive",
"request_info": {
"id": "gtraceid_aiMosV9g4KUbC54jCNjr",
"method": "/controller.servers.services.v1.ServerCoordinationService/Status"
}
},
"datacontentype": "application/cloudevents",
"time": "2023-06-08T14:58:45.112614501Z"
}
In both cases it looks like the request timed out in the middle of running these queries. It is also worth noting that the closedOrphanedConnections
query happens after the checkIfNoLongerActive
query as part of processing a worker status request. And it will run the second query even if the first failed, so if the checkIfNoLongerActive
query fails due to a timeout, I would expect to also see a timeout error for the closeOrphanedConnections
query.
So this points to load issues on the database and/or the controllers. Would you be able to collect some resource utilization metrics for both the database and controllers? Specifically, for the controllers:
And for the database:
It looks like there are three queries that normally run as part of the processing the status report, so ones of particular interest would be queries like:
UPDATE "session_connection" SET "bytes_down"=$1,"bytes_up"=$2,"closed_reason"=$3 WHERE "public_id" = $4
select session_id, state
from session_state ss
where
(ss.state = 'canceling' or ss.state = 'terminated')
and ss.end_time is null
-- Find connections that are not closed so we can reference those IDs
with
unclosed_connections as (
select connection_id
from session_connection_state
where
-- It's the current state
end_time is null
-- Current state isn't closed state
and state in ('authorized', 'connected')
-- It's not in limbo between when it moved into this state and when
-- it started being reported by the worker, which is roughly every
-- 2-3 seconds
and start_time < wt_sub_seconds_from_now(@worker_state_delay_seconds)
),
connections_to_close as (
select public_id
from session_connection
where
-- Related to the worker that just reported to us
worker_id = @worker_id
-- Only unclosed ones
and public_id in (select connection_id from unclosed_connections)
-- These are connection IDs that just got reported to us by the given
-- worker, so they should not be considered closed.
)
update session_connection
set
closed_reason = 'system error'
where
public_id in (select public_id from connections_to_close)
returning public_id;
(This last one looks like one that we looked at as part of #3283)
But it is also possible that some other query might be contributing to the load issue, so if there are any other slow queries that would also be helpful.
Hi @tmessi,
Thanks for the thorough explanation. It all makes sense now. Here's what happened since the last time I wrote:
In a desperate attempt to fix the issue I have purged old sessions from the session
table. That resulted in nearly 150,000 rows being dropped from session_connection
😱. You can see how that affected the database resource consumption below:
(note: the instance was downsized twice in this period which explains less freeable memory) (note: disregard the peak on Jun 8th, that was intentional)
(note: the Top SQL query is still the same one but generating a much lower load - you can see the previous load levels here)
We hadn't realized until we ran the same pipelines through Boundary again yesterday that the issue was gone. Based on what you described above and the symptoms before and after this change, we are confident enough to say that the massive number of rows in session_connection
was the culprit.
Now, while the worst has passed, a few things still intrigue me:
Why did we have so many rows in the session_connection
table even though the referenced sessions were supposed to have expired a long time ago (~2 months) or better saying, why were the sessions themselves never dropped? In fact, I have noticed 4 recent sessions that should have been cleaned up but still show a canceling
status (it's been like this for a few days already).
Session information:
ID: s_XxQTY48IPy
Scope ID: p_wxthS58rtr
Status: canceling
Created Time: Tue, 06 Jun 2023 11:31:46 CEST
Expiration Time: Sun, 11 Jun 2023 11:31:46 CEST
Updated Time: Thu, 08 Jun 2023 15:09:19 CEST
User ID: u_T5FoMKqLF2
Target ID: ttcp_kN45AtC2Lt
Authorized Actions:
no-op
read
read:self
cancel
cancel:self
ID: s_vnxMjNRXfI
Scope ID: p_6tbahFF27g
Status: canceling
Created Time: Wed, 07 Jun 2023 17:32:43 CEST
Expiration Time: Mon, 12 Jun 2023 17:32:43 CEST
Updated Time: Fri, 09 Jun 2023 11:02:24 CEST
User ID: u_QRrP5hlymF
Authorized Actions:
no-op
read
read:self
cancel
cancel:self
ID: s_bLnLrQyYjZ
Scope ID: p_6tbahFF27g
Status: canceling
Created Time: Wed, 07 Jun 2023 17:58:02 CEST
Expiration Time: Mon, 12 Jun 2023 17:58:02 CEST
Updated Time: Fri, 09 Jun 2023 11:02:24 CEST
User ID: u_QRrP5hlymF
Authorized Actions:
no-op
read
read:self
cancel
cancel:self
ID: s_pD1b6GA9KN
Scope ID: p_6tbahFF27g
Status: canceling
Created Time: Thu, 08 Jun 2023 16:05:53 CEST
Expiration Time: Tue, 13 Jun 2023 16:05:53 CEST
Updated Time: Fri, 09 Jun 2023 11:02:24 CEST
User ID: u_QRrP5hlymF
Authorized Actions:
no-op
read
read:self
cancel
cancel:self
Is it coincidence that these sessions also seem to be the ones with most connections?
count | session_id |
---|---|
2435 | s_bLnLrQyYjZ |
2328 | s_NnzbgdEzfn |
2167 | s_XxQTY48IPy |
1962 | s_pD1b6GA9KN |
1534 | s_vnxMjNRXfI |
197 | s_eJKtcbifIu |
152 | s_051rdjAMNL |
151 | s_znu9b68yt0 |
3 out of the 4 sessions have NULL
as target_id
and the other has NULL
as auth_token_id
All our connections are terminated with closed_reason
unknown
, except for a very small number of dangling connections closed by the SQL query that looks for unclosed connections, which have the value system error
.
Found a similar closed issue, not sure if that may play a part here.
@irenarindos
That does seem like some sessions are getting stuck in the canceling
state, which would prevent those sessions and session connections from automatically getting deleted, since only sessions that are terminated
will get deleted.
The controller should be running a query periodically to find any session that is either:
Can you check the controller logs for any of the following error messages that might indicate that this query is failing:
error performing termination of completed sessions
error fetching repository for terminating completed sessions
terminating completed sessions ticking shutting down
This last one would be expected during a normal shutdown, but might indicate an issue if it happens on a controller that is still running.
We had 1 occurrence of the following error in the last 7 days:
{
"id": "3tPyXCDpyF",
"source": "https://hashicorp.com/boundary/boundary-controller-1/controller",
"specversion": "1.0",
"type": "error",
"data": {
"error": "session.(Repository).TerminateCompletedSessions: db.DoTx: unknown, unknown: error #0: dbw.Begin: failed to connect to `host=boundary-postgres.internal.[REDACTED].com user=boundary_app database=boundary`: dial error (timeout: dial tcp 10.21.241.165:5432: connect: connection timed out)",
"error_fields": {
"Code": 0,
"Msg": "",
"Op": "session.(Repository).TerminateCompletedSessions",
"Wrapped": {
"Code": 0,
"Msg": "",
"Op": "db.DoTx",
"Wrapped": {}
}
},
"id": "e_Wi0AOmmRrJ",
"version": "v0.1",
"op": "controller.(Controller).startTerminateCompletedSessionsTicking",
"info": {
"msg": "error performing termination of completed sessions"
}
},
"datacontentype": "application/cloudevents",
"time": "2023-06-08T17:35:10.095189613Z"
}
(seems to point to temporary loss of connectivity)
No occurrences of error fetching repository for terminating completed sessions
No occurrences of terminating completed sessions ticking shutting down
Now, looking further into the 3 cases for session cleanup not working:
In fact, by running the following query
select
session_id
from
session_connection
where public_id in (
select
connection_id
from
session_connection_state
where
state != 'closed' and
end_time is null
)
we get
session_id |
---|
s_XxQTY48IPy |
s_NnzbgdEzfn |
s_Q1wA3hKiaZ |
s_bLnLrQyYjZ |
s_vnxMjNRXfI |
s_bLnLrQyYjZ |
s_XxQTY48IPy |
s_pD1b6GA9KN |
s_vnxMjNRXfI |
In bold the session ids mentioned in my last message. So the question now is rather: how can connections still be considered active long after the session was set to expire? Does it really make sense to wait (and for how long)?
And just an afterthought: is the unknown
closed_reason
something I should worry about? Under what circumstances would that happen? Sorry to ask, I would check the source code if I had more time to investigate.
I don't think the unknown
reason is related, nor cause for concern. It seems that just what is always sent as the reason when the worker reports that the connection is closed. This is probably a separate bug, since I would expect closed by end-user
. If you don't mind opening a separate bug for that so it can more easily be tracked, that would be helpful.
So the question now is rather: how can connections still be considered active long after the session was set to expire?
That is unexpected. Would you mind running the following query to get some additional data about the sessions in this state?
with
canceling_sessions as (
select s.public_id as session_id, s.expiration_time, s.connection_limit
from session as s
join session_state as ss on s.public_id = ss.session_id
where ss.state = 'canceling'
and ss.end_time is null
),
closed_connections as (
select public_id, session_id, worker_id
from session_connection sc
join session_connection_state scs on sc.public_id = scs.connection_id
where end_time is null
and scs.state = 'closed'
),
open_connections as (
select public_id, session_id, worker_id
from session_connection sc
join session_connection_state scs on sc.public_id = scs.connection_id
where end_time is null
and scs.state != 'closed'
)
select cs.session_id, cs.expiration_time < now() as expired,
cs.connection_limit, count(open.public_id) as open_connections,
count(closed.public_id) as closed_connections,
open.worker_id as open_worker_id,
closed.worker_id as closed_worker_id
from canceling_sessions as cs
left outer join closed_connections as closed on cs.session_id = closed.session_id
left outer join open_connections as open on cs.session_id = open.session_id
group by cs.session_id, cs.expiration_time, cs.connection_limit, closed.worker_id, open.worker_id
order by cs.expiration_time desc;
And for any of the worker_ids returned that have the open connections and an expired session. Would you be able to check their logs for any mention of the corresponding session ids?
The query returned
session_id | expired | connection_limit | open_connections | closed_connections | open_worker_id | closed_worker_id |
---|---|---|---|---|---|---|
s_pD1b6GA9KN | TRUE | -1 | 1062 | 1062 | w_pv9puJL6lT | w_sy0swTMhaT |
s_pD1b6GA9KN | TRUE | -1 | 899 | 899 | w_pv9puJL6lT | w_pv9puJL6lT |
s_bLnLrQyYjZ | TRUE | -1 | 864 | 864 | w_sy0swTMhaT | |
s_bLnLrQyYjZ | TRUE | -1 | 4002 | 4002 | w_sy0swTMhaT | w_sy0swTMhaT |
s_vnxMjNRXfI | TRUE | -1 | 1988 | 1988 | w_sy0swTMhaT | |
s_vnxMjNRXfI | TRUE | -1 | 1076 | 1076 | w_sy0swTMhaT | w_sy0swTMhaT |
s_XxQTY48IPy | TRUE | -1 | 718 | 718 | w_R2hr2pbbhx | w_R2hr2pbbhx |
s_XxQTY48IPy | TRUE | -1 | 718 | 718 | w_jfXZ7BVjIM | w_R2hr2pbbhx |
s_XxQTY48IPy | TRUE | -1 | 721 | 721 | w_jfXZ7BVjIM | w_gdEpHsDfNw |
s_XxQTY48IPy | TRUE | -1 | 721 | 721 | w_R2hr2pbbhx | w_gdEpHsDfNw |
s_XxQTY48IPy | TRUE | -1 | 726 | 726 | w_jfXZ7BVjIM | w_jfXZ7BVjIM |
s_XxQTY48IPy | TRUE | -1 | 726 | 726 | w_R2hr2pbbhx | w_jfXZ7BVjIM |
So I checked logs for s_XxQTY48IPy and s_bLnLrQyYjZ and checked for errors around the same time in the same cluster/namespace including for other workers (respectively s_XxQTY48IPy-errors.txt and s_bLnLrQyYjZ-errors.txt).
Is it possible that the worker terminated connection due to status grace period expiration
but because it wasn't able to immediately send that information to the controller, the controller and worker states diverged? Would the worker and controller reconcile all active and terminated connections once the communication is restored? Is there any sort of acknowledgement when that happens or is it on a best-effort basis?
Would the worker and controller reconcile all active and terminated connections once the communication is restored? Is there any sort of acknowledgement when that happens or is it on a best-effort basis?
If the worker is able to reconnect, things should reconcile. And if the worker is down for too long, the controller should mark all the connections as closed, but it appears that something is not reconciling as expected here. Did these workers eventually reconnect? The last thing I see in both logs is a error making status request to controller
. You can check with:
select public_id, last_status_time, operational_state
from server_worker;
And those previous results look a little odd, can you try this query as well:
select s.public_id, count(*), sc.worker_id, scs.state
from session as s
join session_connection as sc on s.public_id = sc.session_id
join session_connection_state as scs on sc.public_id = scs.connection_id
where scs.end_time is null
and s.public_id in ('s_XxQTY48IPy', 's_vnxMjNRXfI', 's_bLnLrQyYjZ', 's_pD1b6GA9KN')
group by (s.public_id, sc.worker_id, scs.state);
Did these workers eventually reconnect?
Yes. To my surprise they are actually active. I thought that because we run workers on Kubernetes, they had been replaced during an upgrade but these are still active.
public_id | last_status_time | operational_state |
---|---|---|
w_pv9puJL6lT | 2023-06-09 07:49:41.125531+00 | shutdown |
w_zBk6zmPQyN | 2023-04-25 13:15:50.685821+00 | shutdown |
w_Q0NJVW1opt | 2023-04-25 13:15:51.372905+00 | shutdown |
w_Pxx5LR88cz | 2023-05-24 08:43:39.203868+00 | shutdown |
w_r1VICDA7Gc | 2023-06-15 17:49:37.920326+00 | active |
w_iry3DfV7Uk | 2023-06-15 17:49:38.166853+00 | active |
w_sy0swTMhaT | 2023-06-15 17:49:38.472904+00 | active |
w_EEFoTpGPWq | 2023-06-15 17:49:38.918738+00 | active |
w_aH0U367xh8 | 2023-06-15 17:49:39.455784+00 | active |
w_jfXZ7BVjIM | 2023-06-15 17:49:39.724975+00 | active |
w_zoZg8pwrjw | 2023-06-15 17:49:37.556514+00 | active |
w_xXjkQ05Etg | 2023-06-15 17:49:38.438825+00 | active |
w_Abv8I5UcIq | 2023-06-15 17:49:38.972074+00 | active |
w_gdEpHsDfNw | 2023-06-15 17:49:39.678872+00 | active |
w_ZoOsCY8yhM | 2023-05-03 10:42:50.29001+00 | shutdown |
w_U1fdrsXWxU | 2023-06-15 17:49:37.986251+00 | active |
w_Sk8AgZ1dgc | 2023-06-15 17:49:38.38113+00 | active |
w_fVQo4gewAr | 2023-06-15 17:49:39.359341+00 | active |
w_yVFcvg9B2b | 2023-05-03 10:42:55.998853+00 | shutdown |
w_R2hr2pbbhx | 2023-06-15 17:49:38.779715+00 | active |
The second query returns
public_id | count | worker_id | state |
---|---|---|---|
s_bLnLrQyYjZ | 2001 | w_sy0swTMhaT | closed |
s_bLnLrQyYjZ | 2 | w_sy0swTMhaT | connected |
s_bLnLrQyYjZ | 432 | closed | |
s_pD1b6GA9KN | 899 | w_pv9puJL6lT | closed |
s_pD1b6GA9KN | 1 | w_pv9puJL6lT | connected |
s_pD1b6GA9KN | 1062 | w_sy0swTMhaT | closed |
s_vnxMjNRXfI | 538 | w_sy0swTMhaT | closed |
s_vnxMjNRXfI | 2 | w_sy0swTMhaT | connected |
s_vnxMjNRXfI | 994 | closed | |
s_XxQTY48IPy | 721 | w_gdEpHsDfNw | closed |
s_XxQTY48IPy | 726 | w_jfXZ7BVjIM | closed |
s_XxQTY48IPy | 1 | w_jfXZ7BVjIM | connected |
s_XxQTY48IPy | 718 | w_R2hr2pbbhx | closed |
s_XxQTY48IPy | 1 | w_R2hr2pbbhx | connected |
So I thought the connections with state connected
could be of special interest
public_id | worker_id | connection_id | state |
---|---|---|---|
s_bLnLrQyYjZ | w_sy0swTMhaT | sc_8mNjmQZgLA | connected |
s_bLnLrQyYjZ | w_sy0swTMhaT | sc_M119RdYaku | connected |
s_pD1b6GA9KN | w_pv9puJL6lT | sc_Fj3UaEvTtS | connected |
s_vnxMjNRXfI | w_sy0swTMhaT | sc_Cj6zbEvrdw | connected |
s_vnxMjNRXfI | w_sy0swTMhaT | sc_ttyQDy1qCe | connected |
s_XxQTY48IPy | w_jfXZ7BVjIM | sc_tJO4CA3KAi | connected |
s_XxQTY48IPy | w_R2hr2pbbhx | sc_pIbwhteeq4 | connected |
and checked the logs for sc_8mNjmQZgLA as well as
10 log lines before and after the first time terminated connection due to status grace period expiration
is logged.
Hi @tmessi,
anything else I can check that would help you understand this issue. If not, I'm thinking of manually removing those sessions again. I haven't noticed any new sessions besides those 4 that haven't been cleaned up on time (up until now).
Hey @macmiranda, sorry for the delay. Tim is out on vaca but we're looking into this. Thanks for your patience ^^
@macmiranda I think I have identified a scenario that could result in sessions connections being left in the connected
state in the database. It would require a specific timing of events in which the worker is unable to report to the controller and then resumes communication. While you were experiencing the issues with #3281, I could see this scenario easily happening. However, I would expect that after resolving those issues, this would be a very rare scenario. Over the last few weeks, are you still seeing old connections in the connected
state? And over that same time period have you see errors with the workers timing out when trying to send status?
Meanwhile I am still working on trying to recreate this to see if this scenario I have in mind is possible, and looking to see how best to address it.
Hi @tmessi,
It would require a specific timing of events in which the worker is unable to report to the controller and then resumes communication. While you were experiencing the issues with #3281, I could see this scenario easily happening.
Yes, it seems like that was the issue.
We have not seen any other dangling session connections since my last message. I'll go on and remove those 4 expired sessions manually which should remove the session connections as well.
As far as the logs go, I wasn't able to check them since we have no logs after upgrading to v0.13.0
. Probably some change related to the default logging level... anyway, will check once we figure that out but I expect not to see status report errors anymore.
Thanks for your help!
Just to let you know, we are still seeing sendWorkerStatus
errors but it seems the worker is handling it well, i.e. canceling the sessions, which I can see in the database with status canceled
.
2023-07-10T01:07:21+02:00 {"id":"kdekmqmHu6","source":"https://hashicorp.com/boundary/boundary-worker-1/worker","specversion":"1.0","type":"error","data":{"error":"status error grace period has expired, canceling all sessions on worker","error_fields":{},"id":"e_eBrnba45R5","version":"v0.1","op":"worker.(Worker).sendWorkerStatus","info":{"grace_period":15000000000,"last_status_time":"2023-07-09 23:02:53.542054 +0000 UTC m=+193175.319334184"}},"datacontentype":"application/cloudevents","time":"2023-07-09T23:07:20.900761366Z"}
2023-07-10T01:07:20+02:00 {"id":"9WrT5DNy7E","source":"https://hashicorp.com/boundary/boundary-worker-1/worker","specversion":"1.0","type":"error","data":{"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded","error_fields":{},"id":"e_qa075SKWcn","version":"v0.1","op":"worker.(Worker).sendWorkerStatus","info":{"msg":"error making status request to controller"}},"datacontentype":"application/cloudevents","time":"2023-07-09T23:03:12.437104648Z"}
2023-07-09T22:07:17+02:00 {"id":"ZD3ShMPy6s","source":"https://hashicorp.com/boundary/boundary-worker-1/worker","specversion":"1.0","type":"error","data":{"error":"status error grace period has expired, canceling all sessions on worker","error_fields":{},"id":"e_7wgYedThQu","version":"v0.1","op":"worker.(Worker).sendWorkerStatus","info":{"grace_period":15000000000,"last_status_time":"2023-07-09 20:03:15.961878223 +0000 UTC m=+182397.739158399"}},"datacontentype":"application/cloudevents","time":"2023-07-09T20:04:17.897453496Z"}
2023-07-09T22:04:00+02:00 {"id":"Fcl7dnFa3g","source":"https://hashicorp.com/boundary/boundary-worker-1/worker","specversion":"1.0","type":"error","data":{"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded","error_fields":{},"id":"e_pBYHUHlny5","version":"v0.1","op":"worker.(Worker).sendWorkerStatus","info":{"msg":"error making status request to controller"}},"datacontentype":"application/cloudevents","time":"2023-07-09T20:03:31.31223578Z"}
Still not sure what may be causing these timeouts but I think we need to do some investigation on our end first.
Describe the bug
For now very little is known about the problem.
psql
sessions break withThe symptom only arises when a certain data pipeline runs 1000s of tests using
dbt
against a Postgres instance proxied via Boundary. It uses high concurrency (8 parallel jobs) and each job can create multiple session connections because of the way the underlyinglibpq
parallelizes the queries, so connection churn may be a factor.Another thing we noticed, when that happens, it affects the other sessions on the same worker too, and as far as I can tell, it may affect sessions of users connected to other workers (i.e. in different clusters) - it's hard to say for sure, but it seems that the dropped connections are all related (which makes me think that something is affecting the controllers).
Other than that, sessions and connections work just fine.
We're running Boundary version
0.12.2
on both controllers and workers, as Kubernetes StatefulSets.To Reproduce
At a high level:
dbt
with 1000s of tests, with high concurrency, against a single targetExpected behavior
Stable connection even under heavy load and concurrency.
Additional context
We've have manually applied the fixes https://github.com/hashicorp/boundary/pull/3283 and https://github.com/hashicorp/boundary/pull/3280
I'm attaching logs from the workers where the machine that runs the data pipeline connects (data-workers.txt) and from the controllers and workers running on a different cluster (controllers+workers.txt). We noticed connection breakage on the workers of this cluster as well. The logs are JSON formatted (but extension was changed because Github didn't like
.json
) and span a 2-min interval in which we observed the issue. They are ordered from the most recent to the oldest.I'd be happy to investigate the issue further and run any debugging commands or gather extra logs.
Thanks.