terascope / teraslice

Scalable data processing pipelines in JavaScript
https://terascope.github.io/teraslice/
Apache License 2.0
50 stars 13 forks source link

[teraslice] Fix ex controller non-zero exit code bug in k8s backend #3756

Closed sotojn closed 2 months ago

sotojn commented 2 months ago

This PR makes the following changes:

Bug Fixes

3755

sotojn commented 2 months ago

Ran a job that blows up with a JSON parsing error after one minute. This is in kubernetesV2 using the kafka reader. Here are the results.

Running curl localhost:5678

{
    "arch": "arm64",
    "clustering_type": "kubernetesV2",
    "name": "ts-dev1",
    "node_version": "v18.20.4",
    "platform": "linux",
    "teraslice_version": "v2.3.0"
}

Execution logs when it dies the first time (Exits with exit code 1):

[2024-09-17T16:18:06.118Z] DEBUG: teraslice/10 on ts-exc-kafka-to-es-json-error-5f3a4bd7-1a9a-lrpwp: dispatched slice f76f0174-0431-4036-bd85-86e20a6f4361 to worker 10.244.0.10__ZZM7zWA7 (assignment=execution_controller, module=execution_controller, worker_id=wLdBBtyp, ex_id=4650a336-107b-45cc-b74f-00025f43b486, job_id=5f3a4bd7-1a9a-4133-bef8-b5d9a129cd79)
[2024-09-17T16:18:27.921Z] ERROR: teraslice/10 on ts-exc-kafka-to-es-json-error-5f3a4bd7-1a9a-lrpwp: execution_controller received an uncaughtException, exiting in 30s... (assignment=execution_controller, module=execution_controller:shutdown_handler, worker_id=wLdBBtyp, ex_id=4650a336-107b-45cc-b74f-00025f43b486, job_id=5f3a4bd7-1a9a-4133-bef8-b5d9a129cd79)
    Error: Unable to parse JSON: Unexpected token n in JSON at position 2
        at Timeout._onTimeout (file:///app/assets/8a13b6aabe2c651a21008d37f487057a754c5648/index.js:22948:15)
        at listOnTimeout (node:internal/timers:569:17)
        at process.processTimers (node:internal/timers:512:7)
[2024-09-17T16:18:28.022Z] DEBUG: teraslice/10 on ts-exc-kafka-to-es-json-error-5f3a4bd7-1a9a-lrpwp: execution shutdown was called for ex 4650a336-107b-45cc-b74f-00025f43b486 (assignment=execution_controller, module=execution_controller, worker_id=wLdBBtyp, ex_id=4650a336-107b-45cc-b74f-00025f43b486, job_id=5f3a4bd7-1a9a-4133-bef8-b5d9a129cd79)
[2024-09-17T16:18:28.023Z] DEBUG: teraslice/10 on ts-exc-kafka-to-es-json-error-5f3a4bd7-1a9a-lrpwp: stopping scheduler... (assignment=execution_controller, module=execution_scheduler, worker_id=wLdBBtyp, ex_id=4650a336-107b-45cc-b74f-00025f43b486, job_id=5f3a4bd7-1a9a-4133-bef8-b5d9a129cd79)
[2024-09-17T16:18:28.023Z] DEBUG: teraslice/10 on ts-exc-kafka-to-es-json-error-5f3a4bd7-1a9a-lrpwp: execution 4650a336-107b-45cc-b74f-00025f43b486 is finished scheduling, 3 remaining slices in the queue (assignment=execution_controller, module=execution_scheduler, worker_id=wLdBBtyp, ex_id=4650a336-107b-45cc-b74f-00025f43b486, job_id=5f3a4bd7-1a9a-4133-bef8-b5d9a129cd79)
[2024-09-17T16:18:28.024Z] DEBUG: teraslice/10 on ts-exc-kafka-to-es-json-error-5f3a4bd7-1a9a-lrpwp: shutdown is waiting for execution to finish... (assignment=execution_controller, module=execution_controller, worker_id=wLdBBtyp, ex_id=4650a336-107b-45cc-b74f-00025f43b486, job_id=5f3a4bd7-1a9a-4133-bef8-b5d9a129cd79)
[2024-09-17T16:18:28.028Z] DEBUG: teraslice/10 on ts-exc-kafka-to-es-json-error-5f3a4bd7-1a9a-lrpwp: execution 4650a336-107b-45cc-b74f-00025f43b486 did not finish (assignment=execution_controller, module=execution_controller, worker_id=wLdBBtyp, ex_id=4650a336-107b-45cc-b74f-00025f43b486, job_id=5f3a4bd7-1a9a-4133-bef8-b5d9a129cd79)
[2024-09-17T16:18:28.028Z]  INFO: teraslice/10 on ts-exc-kafka-to-es-json-error-5f3a4bd7-1a9a-lrpwp: [START] "elasticsearch_sender_api" operation shutdown (assignment=execution_controller, module=slicer_context, worker_id=wLdBBtyp, ex_id=4650a336-107b-45cc-b74f-00025f43b486, job_id=5f3a4bd7-1a9a-4133-bef8-b5d9a129cd79)
[2024-09-17T16:18:28.028Z]  INFO: teraslice/10 on ts-exc-kafka-to-es-json-error-5f3a4bd7-1a9a-lrpwp: [START] "kafka_reader_api:kafka_reader-0" operation shutdown (assignment=execution_controller, module=slicer_context, worker_id=wLdBBtyp, ex_id=4650a336-107b-45cc-b74f-00025f43b486, job_id=5f3a4bd7-1a9a-4133-bef8-b5d9a129cd79)
[2024-09-17T16:18:28.028Z]  INFO: teraslice/10 on ts-exc-kafka-to-es-json-error-5f3a4bd7-1a9a-lrpwp: [START] "kafka_reader" operation shutdown (assignment=execution_controller, module=slicer_context, worker_id=wLdBBtyp, ex_id=4650a336-107b-45cc-b74f-00025f43b486, job_id=5f3a4bd7-1a9a-4133-bef8-b5d9a129cd79)
[2024-09-17T16:18:28.028Z]  INFO: teraslice/10 on ts-exc-kafka-to-es-json-error-5f3a4bd7-1a9a-lrpwp: [FINISH] "elasticsearch_sender_api" operation shutdown, took 0ms (assignment=execution_controller, module=slicer_context, worker_id=wLdBBtyp, ex_id=4650a336-107b-45cc-b74f-00025f43b486, job_id=5f3a4bd7-1a9a-4133-bef8-b5d9a129cd79)
[2024-09-17T16:18:28.028Z]  INFO: teraslice/10 on ts-exc-kafka-to-es-json-error-5f3a4bd7-1a9a-lrpwp: [FINISH] "kafka_reader" operation shutdown, took 0ms (assignment=execution_controller, module=slicer_context, worker_id=wLdBBtyp, ex_id=4650a336-107b-45cc-b74f-00025f43b486, job_id=5f3a4bd7-1a9a-4133-bef8-b5d9a129cd79)
[2024-09-17T16:18:28.029Z]  INFO: teraslice/10 on ts-exc-kafka-to-es-json-error-5f3a4bd7-1a9a-lrpwp: [FINISH] "kafka_reader_api:kafka_reader-0" operation shutdown, took 0ms (assignment=execution_controller, module=slicer_context, worker_id=wLdBBtyp, ex_id=4650a336-107b-45cc-b74f-00025f43b486, job_id=5f3a4bd7-1a9a-4133-bef8-b5d9a129cd79)
[2024-09-17T16:18:28.029Z]  INFO: teraslice/10 on ts-exc-kafka-to-es-json-error-5f3a4bd7-1a9a-lrpwp: calculating statistics (assignment=execution_controller, module=slice_analytics, worker_id=wLdBBtyp, ex_id=4650a336-107b-45cc-b74f-00025f43b486, job_id=5f3a4bd7-1a9a-4133-bef8-b5d9a129cd79)
[2024-09-17T16:18:28.029Z]  INFO: teraslice/10 on ts-exc-kafka-to-es-json-error-5f3a4bd7-1a9a-lrpwp: (assignment=execution_controller, module=slice_analytics, worker_id=wLdBBtyp, ex_id=4650a336-107b-45cc-b74f-00025f43b486, job_id=5f3a4bd7-1a9a-4133-bef8-b5d9a129cd79)

    operation kafka_reader
    average completion time of: 30055 ms, min: 30055 ms, and max: 30055 ms
    average size: 0, min: 0, and max: 0
    average memory: -3266240, min: -3266240, and max: -3266240

[2024-09-17T16:18:28.029Z]  INFO: teraslice/10 on ts-exc-kafka-to-es-json-error-5f3a4bd7-1a9a-lrpwp: (assignment=execution_controller, module=slice_analytics, worker_id=wLdBBtyp, ex_id=4650a336-107b-45cc-b74f-00025f43b486, job_id=5f3a4bd7-1a9a-4133-bef8-b5d9a129cd79)

    operation elasticsearch_bulk
    average completion time of: 0 ms, min: 0 ms, and max: 0 ms
    average size: 0, min: 0, and max: 0
    average memory: 1688, min: 1688, and max: 1688

[2024-09-17T16:18:28.029Z]  INFO: teraslice/10 on ts-exc-kafka-to-es-json-error-5f3a4bd7-1a9a-lrpwp: execution 4650a336-107b-45cc-b74f-00025f43b486 has finished in 60 seconds (assignment=execution_controller, module=execution_controller, worker_id=wLdBBtyp, ex_id=4650a336-107b-45cc-b74f-00025f43b486, job_id=5f3a4bd7-1a9a-4133-bef8-b5d9a129cd79)
[2024-09-17T16:18:28.051Z] ERROR: teraslice/10 on ts-exc-kafka-to-es-json-error-5f3a4bd7-1a9a-lrpwp: execution 4650a336-107b-45cc-b74f-00025f43b486 received shutdown before the slicer could complete, setting status to "terminated" (assignment=execution_controller, module=execution_controller, worker_id=wLdBBtyp, ex_id=4650a336-107b-45cc-b74f-00025f43b486, job_id=5f3a4bd7-1a9a-4133-bef8-b5d9a129cd79)
[2024-09-17T16:18:28.126Z]  INFO: teraslice/10 on ts-exc-kafka-to-es-json-error-5f3a4bd7-1a9a-lrpwp: shutting down (assignment=execution_controller, module=state_storage, worker_id=wLdBBtyp, ex_id=4650a336-107b-45cc-b74f-00025f43b486, job_id=5f3a4bd7-1a9a-4133-bef8-b5d9a129cd79)
[2024-09-17T16:18:28.126Z]  INFO: teraslice/10 on ts-exc-kafka-to-es-json-error-5f3a4bd7-1a9a-lrpwp: shutting down. (assignment=execution_controller, module=ex_storage, worker_id=wLdBBtyp, ex_id=4650a336-107b-45cc-b74f-00025f43b486, job_id=5f3a4bd7-1a9a-4133-bef8-b5d9a129cd79)
[2024-09-17T16:18:28.127Z]  INFO: teraslice/10 on ts-exc-kafka-to-es-json-error-5f3a4bd7-1a9a-lrpwp: client 10.244.0.10__ZZM7zWA7 disconnected { reason: 'client namespace disconnect' } (assignment=execution_controller, module=messaging:server, worker_id=wLdBBtyp, ex_id=4650a336-107b-45cc-b74f-00025f43b486, job_id=5f3a4bd7-1a9a-4133-bef8-b5d9a129cd79)
[2024-09-17T16:18:28.129Z]  INFO: teraslice/10 on ts-exc-kafka-to-es-json-error-5f3a4bd7-1a9a-lrpwp: client 4650a336-107b-45cc-b74f-00025f43b486 disconnected { reason: 'io client disconnect' } (assignment=execution_controller, module=messaging:client, worker_id=wLdBBtyp, ex_id=4650a336-107b-45cc-b74f-00025f43b486, job_id=5f3a4bd7-1a9a-4133-bef8-b5d9a129cd79)
[2024-09-17T16:18:28.171Z] DEBUG: teraslice/10 on ts-exc-kafka-to-es-json-error-5f3a4bd7-1a9a-lrpwp: execution 4650a336-107b-45cc-b74f-00025f43b486 is done (assignment=execution_controller, module=execution_controller, worker_id=wLdBBtyp, ex_id=4650a336-107b-45cc-b74f-00025f43b486, job_id=5f3a4bd7-1a9a-4133-bef8-b5d9a129cd79)
[2024-09-17T16:18:33.131Z]  WARN: teraslice/10 on ts-exc-kafka-to-es-json-error-5f3a4bd7-1a9a-lrpwp: execution controller 4650a336-107b-45cc-b74f-00025f43b486 is shutdown (assignment=execution_controller, module=execution_controller, worker_id=wLdBBtyp, ex_id=4650a336-107b-45cc-b74f-00025f43b486, job_id=5f3a4bd7-1a9a-4133-bef8-b5d9a129cd79)
[2024-09-17T16:18:33.132Z]  INFO: teraslice/10 on ts-exc-kafka-to-es-json-error-5f3a4bd7-1a9a-lrpwp: execution_controller shutdown, already shutting down, remaining 25s (assignment=execution_controller, module=execution_controller:shutdown_handler, worker_id=wLdBBtyp, ex_id=4650a336-107b-45cc-b74f-00025f43b486, job_id=5f3a4bd7-1a9a-4133-bef8-b5d9a129cd79)
[2024-09-17T16:18:34.135Z] DEBUG: teraslice/10 on ts-exc-kafka-to-es-json-error-5f3a4bd7-1a9a-lrpwp: flushed logs successfully, will exit with code 1 (assignment=execution_controller, module=execution_controller:shutdown_handler, worker_id=wLdBBtyp, ex_id=4650a336-107b-45cc-b74f-00025f43b486, job_id=5f3a4bd7-1a9a-4133-bef8-b5d9a129cd79)
[2024-09-17T16:18:34.135Z]  INFO: teraslice/10 on ts-exc-kafka-to-es-json-error-5f3a4bd7-1a9a-lrpwp: execution_controller shutdown took 6s, exit with 1 status code (assignment=execution_controller, module=execution_controller:shutdown_handler, worker_id=wLdBBtyp, ex_id=4650a336-107b-45cc-b74f-00025f43b486, job_id=5f3a4bd7-1a9a-4133-bef8-b5d9a129cd79)

It will come up a second time but in a terminal status so it will exit and tell cluster master to cleanup resources regardless if it's restartable or not. Here are the logs for the ex coming back up:

[2024-09-17T16:18:52.655Z]  INFO: teraslice/10 on ts-exc-kafka-to-es-json-error-5f3a4bd7-1a9a-54dlg: execution storage initialized (assignment=execution_controller, module=ex_storage, worker_id=_XN3y8UG, ex_id=4650a336-107b-45cc-b74f-00025f43b486, job_id=5f3a4bd7-1a9a-4133-bef8-b5d9a129cd79)
[2024-09-17T16:18:52.757Z] ERROR: teraslice/10 on ts-exc-kafka-to-es-json-error-5f3a4bd7-1a9a-54dlg: Unable to verify execution on initialization (assignment=execution_controller, module=execution_controller, worker_id=_XN3y8UG, ex_id=4650a336-107b-45cc-b74f-00025f43b486, job_id=5f3a4bd7-1a9a-4133-bef8-b5d9a129cd79)
    Error: Execution 4650a336-107b-45cc-b74f-00025f43b486 was starting in terminal status, sending execution:finished event to cluster master
        at ExecutionController._verifyExecution (file:///app/source/packages/teraslice/dist/src/lib/workers/execution-controller/index.js:776:21)
        at async ExecutionController.initialize (file:///app/source/packages/teraslice/dist/src/lib/workers/execution-controller/index.js:150:24)
        at async Service.initialize (file:///app/source/packages/teraslice/worker-service.js:40:9)
        at async main (file:///app/source/packages/teraslice/worker-service.js:80:9)
[2024-09-17T16:18:52.757Z]  INFO: teraslice/10 on ts-exc-kafka-to-es-json-error-5f3a4bd7-1a9a-54dlg: shutting down. (assignment=execution_controller, module=ex_storage, worker_id=_XN3y8UG, ex_id=4650a336-107b-45cc-b74f-00025f43b486, job_id=5f3a4bd7-1a9a-4133-bef8-b5d9a129cd79)
[2024-09-17T16:18:52.757Z]  INFO: teraslice/10 on ts-exc-kafka-to-es-json-error-5f3a4bd7-1a9a-54dlg: shutting down (assignment=execution_controller, module=state_storage, worker_id=_XN3y8UG, ex_id=4650a336-107b-45cc-b74f-00025f43b486, job_id=5f3a4bd7-1a9a-4133-bef8-b5d9a129cd79)
[2024-09-17T16:18:52.758Z]  INFO: teraslice/10 on ts-exc-kafka-to-es-json-error-5f3a4bd7-1a9a-54dlg: client 4650a336-107b-45cc-b74f-00025f43b486 disconnected { reason: 'io client disconnect' } (assignment=execution_controller, module=messaging:client, worker_id=_XN3y8UG, ex_id=4650a336-107b-45cc-b74f-00025f43b486, job_id=5f3a4bd7-1a9a-4133-bef8-b5d9a129cd79)
[2024-09-17T16:18:52.789Z]  INFO: teraslice/10 on ts-exc-kafka-to-es-json-error-5f3a4bd7-1a9a-54dlg: execution_controller received process:SIGTERM, already shutting down, remaining 30s (assignment=execution_controller, module=execution_controller:shutdown_handler, worker_id=_XN3y8UG, ex_id=4650a336-107b-45cc-b74f-00025f43b486, job_id=5f3a4bd7-1a9a-4133-bef8-b5d9a129cd79)
[2024-09-17T16:18:53.859Z] DEBUG: teraslice/10 on ts-exc-kafka-to-es-json-error-5f3a4bd7-1a9a-54dlg: flushed logs successfully, will exit with code 0 (assignment=execution_controller, module=execution_controller:shutdown_handler, worker_id=_XN3y8UG, ex_id=4650a336-107b-45cc-b74f-00025f43b486, job_id=5f3a4bd7-1a9a-4133-bef8-b5d9a129cd79)
[2024-09-17T16:18:53.859Z]  INFO: teraslice/10 on ts-exc-kafka-to-es-json-error-5f3a4bd7-1a9a-54dlg: execution_controller shutdown took 1s, exit with 0 status code (assignment=execution_controller, module=execution_controller:shutdown_handler, worker_id=_XN3y8UG, ex_id=4650a336-107b-45cc-b74f-00025f43b486, job_id=5f3a4bd7-1a9a-4133-bef8-b5d9a129cd79)

Running kubectl -n ts-dev1 get all after second ex pod exits:

NAME                                    READY   STATUS    RESTARTS   AGE
pod/teraslice-master-54c7bfddfd-ntmv2   1/1     Running   0          12m

NAME                       TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
service/teraslice-master   NodePort   10.96.166.169   <none>        5678:30678/TCP   12m

NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/teraslice-master   1/1     1            1           12m

NAME                                          DESIRED   CURRENT   READY   AGE
replicaset.apps/teraslice-master-54c7bfddfd   1         1         1       12m

Running curl localhost:5678/txt/ex after job exit:

name                    lifecycle   slicers  workers  _status     ex_id                                 job_id                                _created                  _updated                
----------------------  ----------  -------  -------  ----------  ------------------------------------  ------------------------------------  ------------------------  ------------------------
kafka-to-es-json-error  persistent  1        1        terminated  4650a336-107b-45cc-b74f-00025f43b486  5f3a4bd7-1a9a-4133-bef8-b5d9a129cd79  2024-09-17T16:17:18.591Z  2024-09-17T16:18:28.053Z