terascope / teraslice

Scalable data processing pipelines in JavaScript
https://terascope.github.io/teraslice/
Apache License 2.0
50 stars 13 forks source link

Execution controller OOMs when large asset is used in job #3595

Closed godber closed 5 months ago

godber commented 5 months ago

We have been doing further testing of the S3 backed asset store and we recently tested with an internal asset that was 60MB zipped. Unzipped the asset had the following composition:

.
├── [ 192]  __static_assets
│   ├── [ 22M]  data1.json.gz
│   ├── [7.1K]  data2.txt
│   ├── [1.2K]  data3.json
│   └── [ 37M]  data4.json.gz
├── [ 200]  asset.json
└── [7.5M]  index.js

It should be sufficient to create a mock asset with roughly the same characteristics and get a job to start up with this asset. The execution controller should then OOM when run in k8s using the default memory limit of 512MB. We tested increasing the memory limit (to 6GB) and the execution controller did not OOM.

Here is some of the log output:

[2024-04-17T22:41:06.009Z] DEBUG: teraslice/18 on ts-exc-datagen-100m-noop-tmp1-3d18bacc-9e6d-7fsdc: getting record with id: 46f47558baf4e3f0e8f736ad5c91827a53cc4b4b from s3 minio_test1 connection, ts-assets-teraslice-tmp1 bucket. (assignment=execution_controller, module=assets_storage, worker_id=97W8Ruer, ex_id=77c23ff5-409e-4b45-a264-c089fd90b3e1, job_id=3d18bacc-9e6d-4651-96bf-5fffec667073)
[2024-04-17T22:41:06.533Z]  INFO: teraslice/18 on ts-exc-datagen-100m-noop-tmp1-3d18bacc-9e6d-7fsdc: loading assets: a5b3d9e48bce3b5f997ba7c21cb3d47945e231a2 (assignment=execution_controller, module=asset_loader, worker_id=97W8Ruer, ex_id=77c23ff5-409e-4b45-a264-c089fd90b3e1, job_id=3d18bacc-9e6d-4651-96bf-5fffec667073)
[2024-04-17T22:41:06.808Z]  INFO: teraslice/18 on ts-exc-datagen-100m-noop-tmp1-3d18bacc-9e6d-7fsdc: decompressing and saving asset a5b3d9e48bce3b5f997ba7c21cb3d47945e231a2 to /app/assets/a5b3d9e48bce3b5f997ba7c21cb3d47945e231a2 (assignment=execution_controller, module=asset_loader, worker_id=97W8Ruer, ex_id=77c23ff5-409e-4b45-a264-c089fd90b3e1, job_id=3d18bacc-9e6d-4651-96bf-5fffec667073)
[2024-04-17T22:41:10.938Z] ERROR: teraslice/7 on ts-exc-datagen-100m-noop-tmp1-3d18bacc-9e6d-7fsdc: Teraslice Worker shutting down due to failure! (assignment=execution_controller)
    Error: Failure to get assets, caused by exit code null
        at ChildProcess.<anonymous> (file:///app/source/packages/teraslice/dist/src/lib/workers/assets/spawn.js:45:31)
        at ChildProcess.emit (node:events:517:28)
        at maybeClose (node:internal/child_process:1098:16)
        at ChildProcess._handle.onexit (node:internal/child_process:303:5)

If necessary, I can supply the internal asset separately.

cc @busma13

godber commented 5 months ago

After further discussions with Peter and Joseph there are a number of other things that limit asset size:

It's possible that our choice of zip archives for assets make them not streamable ... so we might be a bit stuck there too.

Regardless, we are, at the very least, going to look at reducing overall memory usage during the asset load process to increase that size.

sotojn commented 5 months ago

Steps to recreate this issue locally:

  1. Mock up a local teraaslice in kubernetes by running yarn k8s:minio --asset-storage='s3'.

  2. Upload said 60mb zipped asset using earl or add the zipped asset to the autoload folder to skip this step

  3. Create and register a job that uses said 60mb asset asset and start the job

  4. Run kubectl get pods -n ts-dev1 to view all the running pods in the namespace

  5. The pod with the name that starts with ts-exc should be seen restarting and having a status of OOM

godber commented 5 months ago

The changes in https://github.com/terascope/teraslice/pull/3598 are sufficient to resolve this issue.