Node 18 master pod restart fix

sotojn commented 11 months ago

I have fixed a bug that happens on node 18 that is addressed here #3457

The issue was that the execution client has a variable called serverShutdown that gets set to true when the masterpod is told to shutdown. But when the master pod is booted back up and reconnects with the client, the client can no longer send execution analytics to the master pod causing the following error.

[2023-11-27T17:49:38.468Z] ERROR: teraslice/10 on ts-exc-kafka-to-es-e052b800-fee7-6vwqv: Client ClusterMaster is not ready (assignment=execution_controller, module=messaging:client, worker_id=_Q15YZQ9, ex_id=c3c188b1-4884-4867-bf86-c126f7b24034, job_id=e052b800-fee7-4af6-994b-3f5a4f78fd31)
    Error: Client ClusterMaster is not ready
        at Client.waitForClientReady (/app/source/packages/teraslice-messaging/dist/src/messenger/core.js:99:19)
        at runNextTicks (node:internal/process/task_queues:60:5)
        at process.processTimers (node:internal/timers:509:9)
        at async Socket.<anonymous> (/app/source/packages/teraslice-messaging/dist/src/messenger/core.js:74:21)
[2023-11-27T17:49:39.815Z]  WARN: teraslice/10 on ts-exc-kafka-to-es-e052b800-fee7-6vwqv: cluster master did not record the cluster analytics (assignment=execution_controller, module=execution_analytics, worker_id=_Q15YZQ9, ex_id=c3c188b1-4884-4867-bf86-c126f7b24034, job_id=e052b800-fee7-4af6-994b-3f5a4f78fd31)

This also causes a timeout error on the master pod server:

[2023-11-08T17:44:47.677Z]  INFO: teraslice/20 on teraslice-qa-master-6bccc5dfc8-mn5kz: execution 8d30d3e0-f1de-4bd8-a040-a276ff75f98f is connected (assignment=cluster_master, module=kubernetes_cluster_service, worker_id=5PZc0y47)
[2023-11-08T17:44:47.703Z]  INFO: teraslice/20 on teraslice-qa-master-6bccc5dfc8-mn5kz: execution 28906665-2a5d-4da5-8808-287fc70309dd is connected (assignment=cluster_master, module=kubernetes_cluster_service, worker_id=5PZc0y47)
[2023-11-08T17:45:03.577Z]  INFO: teraslice/20 on teraslice-qa-master-6bccc5dfc8-mn5kz: execution ebb1fc42-e9e3-4631-adba-158bbe35b11a is connected (assignment=cluster_master, module=kubernetes_cluster_service, worker_id=5PZc0y47)
[2023-11-08T17:47:33.024Z] ERROR: teraslice/20 on teraslice-qa-master-6bccc5dfc8-mn5kz: Timed out after 2m, waiting for message "execution:analytics" (assignment=cluster_master, module=api_service, worker_id=5PZc0y47)
    Error: Timed out after 2m, waiting for message "execution:analytics"
        at Server.handleSendResponse (/app/source/packages/teraslice-messaging/dist/src/messenger/core.js:43:19)
        at runNextTicks (node:internal/process/task_queues:60:5)
        at process.processTimers (node:internal/timers:509:9)
        at async Promise.all (index 0)
        at async Object.getControllerStats (/app/source/packages/teraslice/lib/cluster/services/execution.js:264:25)
        at async /app/source/packages/teraslice/lib/utils/api_utils.js:54:28
[2023-11-08T17:47:34.017Z] ERROR: teraslice/20 on teraslice-qa-master-6bccc5dfc8-mn5kz: Timed out after 2m, waiting for message "execution:analytics" (assignment=cluster_master, module=api_service, worker_id=5PZc0y47)
    Error: Timed out after 2m, waiting for message "execution:analytics"
        at Server.handleSendResponse (/app/source/packages/teraslice-messaging/dist/src/messenger/core.js:43:19)
        at runNextTicks (node:internal/process/task_queues:60:5)
        at process.processTimers (node:internal/timers:509:9)
        at async Promise.all (index 0)
        at async Object.getControllerStats (/app/source/packages/teraslice/lib/cluster/services/execution.js:264:25)
        at async /app/source/packages/teraslice/lib/utils/api_utils.js:54:28

This PR now sets serverShutdown back to false when the execution client gets reconnected with the master pod fixing the issue.

godber commented 11 months ago

Since there is no test coverage for this, does your manual testing show controller analytics continuing to work /txt/controllers on running jobs after a master restart?

sotojn commented 11 months ago

Since there is no test coverage for this, does your manual testing show controller analytics continuing to work /txt/controllers on running jobs after a master restart?

I have curled the txt/controller endpoints before and after the master pod shutdown and it works. I also had ts-top running in the background which did update after the restart

godber commented 11 months ago

I've manually confirmed that these changes allow the Teraslice master to be restarted without causing the messaging system to break. I am able to still hit /txt/controllers and there are no errors in the logs.

Here's some hastily gathered notes on my steps:

yarn k8s
earl assets deploy local --bundle terascope/elasticsearch-assets
earl assets deploy local --bundle terascope/standard-assets
earl tjm register local examples/jobs/data_generator.json
earl tjm start examples/jobs/data_generator.json

kubectl get namespaces | grep dev1
services-dev1        Active   12m
ts-dev1              Active   5m12s

kubectl -n ts-dev1 get pods
NAME                                                   READY   STATUS    RESTARTS   AGE
teraslice-master-84d4c87c7b-9vhqr                      1/1     Running   0          5m41s
ts-exc-data-generator-bce93c1e-d1db-lhqln              1/1     Running   0          2m16s
ts-wkr-data-generator-bce93c1e-d1db-5d9d8f7bb6-qczz5   1/1     Running   0          2m14s

kubectl -n ts-dev1 logs -f teraslice-master-84d4c87c7b-9vhqr | bunyan

kubectl -n ts-dev1 delete pod teraslice-master-84d4c87c7b-9vhqr
pod "teraslice-master-84d4c87c7b-9vhqr" deleted

# doesnt return
curl localhost:5678/txt/controllers

And we see the errors

[2023-11-28T00:43:01.792Z] ERROR: teraslice/17 on teraslice-master-84d4c87c7b-n8pb9: Timed out after 2m, waiting for message "execution:analytics" (assignment=cluster_master, module=api_service, worker_id=pYaRZ5mM)
    Error: Timed out after 2m, waiting for message "execution:analytics"
        at Server.handleSendResponse (/app/source/packages/teraslice-messaging/dist/src/messenger/core.js:43:19)
        at async Promise.all (index 0)
        at async Object.getControllerStats (/app/source/packages/teraslice/lib/cluster/services/execution.js:264:25)
        at async /app/source/packages/teraslice/lib/cluster/services/api.js:418:27
        at async /app/source/packages/teraslice/lib/utils/api_utils.js:54:28

Teardown

kind delete cluster -n k8se2e
git checkout examples/jobs/data_generator.json

Repeat with the modified code, and it works.

sotojn commented 11 months ago

Please bump the patch level version on the teraslice package and I will merge this.

All the following packages have been bumped:

[8667660] Teraslice from v0.87.0 to v0.87.1
[5fa9368] @terascope/teraslice-messaging from v0.34.0 to v0.34.1

terascope / teraslice

Node 18 master pod restart fix #3477