Closed sotojn closed 11 months ago
Since there is no test coverage for this, does your manual testing show controller analytics continuing to work /txt/controllers
on running jobs after a master restart?
Since there is no test coverage for this, does your manual testing show controller analytics continuing to work
/txt/controllers
on running jobs after a master restart?
I have curled the txt/controller
endpoints before and after the master pod shutdown and it works. I also had ts-top
running in the background which did update after the restart
I've manually confirmed that these changes allow the Teraslice master to be restarted without causing the messaging system to break. I am able to still hit /txt/controllers
and there are no errors in the logs.
Here's some hastily gathered notes on my steps:
yarn k8s
earl assets deploy local --bundle terascope/elasticsearch-assets
earl assets deploy local --bundle terascope/standard-assets
earl tjm register local examples/jobs/data_generator.json
earl tjm start examples/jobs/data_generator.json
kubectl get namespaces | grep dev1
services-dev1 Active 12m
ts-dev1 Active 5m12s
kubectl -n ts-dev1 get pods
NAME READY STATUS RESTARTS AGE
teraslice-master-84d4c87c7b-9vhqr 1/1 Running 0 5m41s
ts-exc-data-generator-bce93c1e-d1db-lhqln 1/1 Running 0 2m16s
ts-wkr-data-generator-bce93c1e-d1db-5d9d8f7bb6-qczz5 1/1 Running 0 2m14s
kubectl -n ts-dev1 logs -f teraslice-master-84d4c87c7b-9vhqr | bunyan
kubectl -n ts-dev1 delete pod teraslice-master-84d4c87c7b-9vhqr
pod "teraslice-master-84d4c87c7b-9vhqr" deleted
# doesnt return
curl localhost:5678/txt/controllers
And we see the errors
[2023-11-28T00:43:01.792Z] ERROR: teraslice/17 on teraslice-master-84d4c87c7b-n8pb9: Timed out after 2m, waiting for message "execution:analytics" (assignment=cluster_master, module=api_service, worker_id=pYaRZ5mM)
Error: Timed out after 2m, waiting for message "execution:analytics"
at Server.handleSendResponse (/app/source/packages/teraslice-messaging/dist/src/messenger/core.js:43:19)
at async Promise.all (index 0)
at async Object.getControllerStats (/app/source/packages/teraslice/lib/cluster/services/execution.js:264:25)
at async /app/source/packages/teraslice/lib/cluster/services/api.js:418:27
at async /app/source/packages/teraslice/lib/utils/api_utils.js:54:28
Teardown
kind delete cluster -n k8se2e
git checkout examples/jobs/data_generator.json
Repeat with the modified code, and it works.
Please bump the patch level version on the teraslice package and I will merge this.
All the following packages have been bumped:
v0.87.0
to v0.87.1
v0.34.0
to v0.34.1
I have fixed a bug that happens on node 18 that is addressed here #3457
The issue was that the execution client has a variable called
serverShutdown
that gets set totrue
when the masterpod is told to shutdown. But when the master pod is booted back up and reconnects with the client, the client can no longer send execution analytics to the master pod causing the following error.This also causes a timeout error on the master pod server:
This PR now sets
serverShutdown
back to false when the execution client gets reconnected with the master pod fixing the issue.