terascope / teraslice

Scalable data processing pipelines in JavaScript
https://terascope.github.io/teraslice/
Apache License 2.0
50 stars 13 forks source link

Failure initializing assets_service does not cause a fatal fault #3596

Closed briend closed 3 weeks ago

briend commented 3 months ago

If you use s3 for asset storage (asset_storage_connection_type: s3) and misconfigure the s3 connector with an incorrect certLocation or caCertificate, the teraslice master container does not exit with an error (but also does not function, obviously). Normally I'd expect a crash-looping Pod.

example master logs for invalid certLocation:

[2024-04-17T00:22:56.578Z]  INFO: teraslice/7 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: Service starting (assignment=node_master)
[2024-04-17T00:22:56.580Z] DEBUG: teraslice/7 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: messaging service configuration for assignment node_master (assignment=node_master, module=node_master)
[2024-04-17T00:22:56.580Z]  INFO: teraslice/7 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: node teraslice-tmp1-master-5cf46cd4bf-5wrf6 is attempting to connect to cluster_master: http://teraslice-tmp1:5678 (assignment=node_master, module=node_master)
[2024-04-17T00:22:56.593Z] DEBUG: teraslice/7 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: client network connection is online (assignment=node_master, module=node_master)
[2024-04-17T00:22:56.593Z] DEBUG: teraslice/7 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: node teraslice-tmp1-master-5cf46cd4bf-5wrf6 is creating the cluster_master (assignment=node_master, module=node_master)
[2024-04-17T00:22:56.594Z]  INFO: teraslice/7 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: Starting 1 cluster_master (assignment=node_master)
[2024-04-17T00:22:56.599Z] DEBUG: teraslice/7 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: node teraslice-tmp1-master-5cf46cd4bf-5wrf6 is creating assets endpoint on port 45679 (assignment=node_master, module=node_master)
[2024-04-17T00:22:56.599Z]  INFO: teraslice/7 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: Starting 1 assets_service (assignment=node_master)
[2024-04-17T00:22:56.648Z] DEBUG: teraslice/7 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: worker process has come online (assignment=node_master, module=node_master)
[2024-04-17T00:22:56.648Z] DEBUG: teraslice/7 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: worker process has come online (assignment=node_master, module=node_master)
(node:19) ExperimentalWarning: Importing JSON modules is an experimental feature and might change at any time
(Use `node --trace-warnings ...` to show where the warning was created)
(node:18) ExperimentalWarning: Importing JSON modules is an experimental feature and might change at any time
(Use `node --trace-warnings ...` to show where the warning was created)
[2024-04-17T00:22:59.530Z] ERROR: teraslice/19 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: Cluster Worker shutting down due to failure! (assignment=assets_service, err.code=INTERNAL_SERVER_ERROR)
    TSError: Failure while creating assets_service, caused by Error: No cert path was found in config.certLocation: "/app/config/certs/root.pem"
        at AssetsService.initialize (file:///app/source/packages/teraslice/dist/src/lib/cluster/services/assets.js:120:19)
        at async Service.initialize (file:///app/source/packages/teraslice/cluster-service.js:26:9)
        at async main (file:///app/source/packages/teraslice/cluster-service.js:53:9)
[2024-04-17T00:22:59.531Z]  INFO: teraslice/19 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: creating connection for elasticsearch-next (assignment=assets_service)
[2024-04-17T00:22:59.553Z] DEBUG: teraslice/19 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: Creating an opensearch client v1 (assignment=assets_service)
[2024-04-17T00:22:59.631Z]  INFO: teraslice/19 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: shutting asset store down. (assignment=assets_service, module=assets_storage, worker_id=oiZuWMIj)
[2024-04-17T00:22:59.632Z] ERROR: teraslice/19 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: shutdown error after timeout (assignment=assets_service, module=assets_service:shutdown_handler, worker_id=oiZuWMIj)
    TypeError: Cannot read properties of undefined (reading 'destroy')
        at S3Store.shutdown (file:///app/source/packages/teraslice/dist/src/lib/storage/backends/s3_store.js:214:18)
        at AssetsStorage.shutdown (file:///app/source/packages/teraslice/dist/src/lib/storage/assets.js:271:32)
        at AssetsService.shutdown (file:///app/source/packages/teraslice/dist/src/lib/cluster/services/assets.js:214:34)
        at file:///app/source/packages/teraslice/cluster-service.js:49:29
        at callShutdownFn (file:///app/source/packages/teraslice/dist/src/lib/workers/helpers/worker-shutdown.js:69:15)
        at async /app/source/packages/utils/dist/src/promises.js:240:32
[2024-04-17T00:22:59.637Z]  INFO: teraslice/18 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: cluster master listening on port 5678 (assignment=cluster_master, module=cluster_master, worker_id=dsmhwABB)
[2024-04-17T00:22:59.640Z]  INFO: teraslice/18 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: creating connection for elasticsearch-next (assignment=cluster_master)
[2024-04-17T00:22:59.651Z]  INFO: teraslice/18 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: creating connection for elasticsearch-next (assignment=cluster_master)
[2024-04-17T00:22:59.653Z]  INFO: teraslice/18 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: creating connection for elasticsearch-next (assignment=cluster_master)
[2024-04-17T00:22:59.661Z] DEBUG: teraslice/18 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: Creating an opensearch client v1 (assignment=cluster_master)
[2024-04-17T00:22:59.665Z] DEBUG: teraslice/18 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: Creating an opensearch client v1 (assignment=cluster_master)
[2024-04-17T00:22:59.667Z] DEBUG: teraslice/18 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: Creating an opensearch client v1 (assignment=cluster_master)
[2024-04-17T00:22:59.678Z]  INFO: teraslice/18 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: execution storage initialized (assignment=cluster_master, module=ex_storage, worker_id=dsmhwABB)
[2024-04-17T00:22:59.678Z]  INFO: teraslice/18 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: state storage initialized (assignment=cluster_master, module=state_storage, worker_id=dsmhwABB)
[2024-04-17T00:22:59.679Z]  INFO: teraslice/18 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: job storage initialized (assignment=cluster_master, module=job_storage, worker_id=dsmhwABB)
[2024-04-17T00:22:59.680Z]  INFO: teraslice/18 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: kubernetes clustering initializing (assignment=cluster_master, module=kubernetes_cluster_service, worker_id=dsmhwABB)
[2024-04-17T00:22:59.864Z]  INFO: teraslice/18 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: execution service is initializing... (assignment=cluster_master, module=execution_service, worker_id=dsmhwABB)
[2024-04-17T00:22:59.885Z] DEBUG: teraslice/18 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: execution queue initialization complete (assignment=cluster_master, module=execution_service, worker_id=dsmhwABB)
[2024-04-17T00:22:59.886Z]  INFO: teraslice/18 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: job service is initializing... (assignment=cluster_master, module=jobs_service, worker_id=dsmhwABB)
[2024-04-17T00:22:59.886Z] DEBUG: teraslice/18 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: services has been initialized (assignment=cluster_master, module=cluster_master, worker_id=dsmhwABB)
[2024-04-17T00:22:59.896Z] DEBUG: teraslice/18 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: asset service not up yet, error: connect ECONNREFUSED 127.0.0.1:45679 (assignment=cluster_master, module=cluster_master, worker_id=dsmhwABB)
[2024-04-17T00:23:00.633Z] DEBUG: teraslice/19 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: flushed logs successfully, will exit with code 0 (assignment=assets_service, module=assets_service:shutdown_handler, worker_id=oiZuWMIj)
[2024-04-17T00:23:00.633Z]  INFO: teraslice/19 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: assets_service shutdown took 1s, exit with 0 status code (assignment=assets_service, module=assets_service:shutdown_handler, worker_id=oiZuWMIj)
[2024-04-17T00:23:00.653Z]  INFO: teraslice/7 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: assets_service has exited, id: 2, code: 0, signal: null (assignment=node_master)
[2024-04-17T00:23:00.900Z] DEBUG: teraslice/18 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: asset service not up yet, error: connect ECONNREFUSED 127.0.0.1:45679 (assignment=cluster_master, module=cluster_master, worker_id=dsmhwABB)
...<repeated message>
[2024-04-17T00:27:59.950Z] ERROR: teraslice/18 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: error during service initialization (assignment=cluster_master, module=cluster_master, worker_id=dsmhwABB)
    Error: Timeout waiting for asset service to come online
        at ClusterMaster.waitForAssetsService (file:///app/source/packages/teraslice/dist/src/lib/cluster/cluster_master.js:38:35)
        at ClusterMaster.waitForAssetsService (file:///app/source/packages/teraslice/dist/src/lib/cluster/cluster_master.js:45:21)
        at async ClusterMaster.initialize (file:///app/source/packages/teraslice/dist/src/lib/cluster/cluster_master.js:112:13)
        at async Service.initialize (file:///app/source/packages/teraslice/cluster-service.js:26:9)
        at async main (file:///app/source/packages/teraslice/cluster-service.js:53:9)
[2024-04-17T00:27:59.950Z] ERROR: teraslice/18 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: Cluster Worker shutting down due to failure! (assignment=cluster_master)
    Error: Timeout waiting for asset service to come online
        at ClusterMaster.waitForAssetsService (file:///app/source/packages/teraslice/dist/src/lib/cluster/cluster_master.js:38:35)
        at ClusterMaster.waitForAssetsService (file:///app/source/packages/teraslice/dist/src/lib/cluster/cluster_master.js:45:21)
        at async ClusterMaster.initialize (file:///app/source/packages/teraslice/dist/src/lib/cluster/cluster_master.js:112:13)
        at async Service.initialize (file:///app/source/packages/teraslice/cluster-service.js:26:9)
        at async main (file:///app/source/packages/teraslice/cluster-service.js:53:9)
[2024-04-17T00:28:00.051Z]  INFO: teraslice/18 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: cluster_master is shutting down (assignment=cluster_master, module=cluster_master, worker_id=dsmhwABB)
[2024-04-17T00:28:00.052Z]  INFO: teraslice/18 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: shutting down (assignment=cluster_master, module=execution_service, worker_id=dsmhwABB)
[2024-04-17T00:28:00.053Z]  INFO: teraslice/18 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: shutting down api service (assignment=cluster_master, module=api_service, worker_id=dsmhwABB)
[2024-04-17T00:28:00.057Z]  INFO: teraslice/18 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: shutting down. (assignment=cluster_master, module=ex_storage, worker_id=dsmhwABB)
[2024-04-17T00:28:00.058Z]  INFO: teraslice/18 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: shutting down (assignment=cluster_master, module=state_storage, worker_id=dsmhwABB)
[2024-04-17T00:28:00.058Z]  INFO: teraslice/18 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: shutting down. (assignment=cluster_master, module=job_storage, worker_id=dsmhwABB)
[2024-04-17T00:28:06.066Z] DEBUG: teraslice/18 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: flushed logs successfully, will exit with code 0 (assignment=cluster_master, module=cluster_master:shutdown_handler, worker_id=dsmhwABB)
[2024-04-17T00:28:06.066Z]  INFO: teraslice/18 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: cluster_master shutdown took 6s, exit with 0 status code (assignment=cluster_master, module=cluster_master:shutdown_handler, worker_id=dsmhwABB)
[2024-04-17T00:28:06.094Z]  INFO: teraslice/7 on teraslice-tmp1-master-5cf46cd4bf-5wrf6: cluster_master has exited, id: 1, code: 0, signal: null (assignment=node_master)
godber commented 1 month ago

@sotojn try and reproduce this and tell us what the job status is and show us all the k8s resource statuses ... something like kubectl -n ts-test get all,svc.

godber commented 1 month ago

If the execution controller process doesn't shut down correctly in this case, that is likely an ADDITIONAL problem.