unity-sds / unity-sps-workflows

Catalog of CWL workflows
Apache License 2.0
0 stars 3 forks source link

[Risk]: Processing large data volumes with CWL and Docker #14

Open LucaCinquini opened 1 year ago

LucaCinquini commented 1 year ago

Who: U-SPS When: April 2023 What: Copying data from PCM to Docker Container might cause issues: large data volumes require a lot of storage/time for data transfer. Might cause issues with CWL.

LucaCinquini commented 1 year ago

TLDR: when executing multiple CWL steps via Docker containers, it seems like data is not copied every time from one container to the other, but rather referenced by mounting volumes into the successive containers. So when executing the CHIRP workflow we should make sure that: a) The EKS node has enough storage to store all input and output data (a single copy only) b) The CWL steps do NOT use the "staging" option, which would cause the input data to be copied to the current working directory. In other words, do NOT do something like this: requirements: InitialWorkDirRequirement: listing:

LucaCinquini commented 1 year ago

The evidence: I run the L1A workflow which downloads data from DAPA, and uses ancillary data stored on EFS. The detailed steps of each Docker execution show the volumes been mounted onto successive containers.

cwl-runner ssips_L1a_workflow.cwl ssips_L1a_workflow_mcp_test.yml .... INFO [job l1a-stage-in-2] /tmp/hrenmt53$ docker \ run \ -i \ --mount=type=bind,source=/tmp/hrenmt53,target=/NIGNgi \ --mount=type=bind,source=/tmp/60tmw3l0,target=/tmp \ --workdir=/NIGNgi \ --read-only=true \ --log-driver=none \ --user=1000:1000 \ --rm \ --cidfile=/tmp/a0fzvbt4/20230420171306-696452.cid \ --env=TMPDIR=/tmp \ --env=HOME=/NIGNgi \ --env=AWS_REGION=us-west-2 \ '--env=CLIENT_ID=(secret-537db025-591d-4397-a20a-405a57b025da)' \ --env=COGNITO_URL=https://cognito-idp.us-west-2.amazonaws.com \ --env=COLLECTION_ID=L0_SNPP_ATMS_SCIENCE_1 \ --env=DAPA_API=https://58nbcawrvb.execute-api.us-west-2.amazonaws.com/test \ --env=DATE_FROM=2016-01-14T08:00:00Z \ --env=DATE_TO=2016-01-14T11:59:59Z \ --env=DOWNLOAD_DIR=/NIGNgi/atms_science \ --env=LIMITS=100 \ --env=LOG_LEVEL=20 \ '--env=PASSWORD=(secret-4af20f79-7640-4c15-b158-39846b7c8680)' \ --env=PASSWORD_TYPE=PARAM_STORE \ '--env=USERNAME=(secret-f62c6163-fa2a-4d0b-8b4e-6cc82dc0f0c1)' \ --env=VERIFY_SSL=FALSE \ ghcr.io/unity-sds/unity-data-services:1.10.1 \ download > /tmp/hrenmt53/stdout_dapa_download.txt 2> /tmp/hrenmt53/stderr_dapa_download.txt INFO [job l1a-stage-in-2] Max memory used: 0MiB INFO [job l1a-stage-in-2] completed success INFO [step l1a-stage-in-2] completed success INFO [workflow ] starting step l1a-stage-in-1 INFO [step l1a-stage-in-1] start INFO [job l1a-stage-in-1] /tmp/6pktu7pp$ docker \ run \ -i \ --mount=type=bind,source=/tmp/6pktu7pp,target=/NIGNgi \ --mount=type=bind,source=/tmp/q1mpumql,target=/tmp \ --workdir=/NIGNgi \ --read-only=true \ --log-driver=none \ --user=1000:1000 \ --rm \ --cidfile=/tmp/4ssdpgrz/20230420171336-064311.cid \ --env=TMPDIR=/tmp \ --env=HOME=/NIGNgi \ --env=AWS_REGION=us-west-2 \ '--env=CLIENT_ID=(secret-537db025-591d-4397-a20a-405a57b025da)' \ --env=COGNITO_URL=https://cognito-idp.us-west-2.amazonaws.com \ --env=COLLECTION_ID=L0_SNPP_EphAtt___1 \ --env=DAPA_API=https://58nbcawrvb.execute-api.us-west-2.amazonaws.com/test \ --env=DATE_FROM=2016-01-14T08:00:00Z \ --env=DATE_TO=2016-01-14T11:59:59Z \ --env=DOWNLOAD_DIR=/NIGNgi/ephatt \ --env=LIMITS=100 \ --env=LOG_LEVEL=20 \ '--env=PASSWORD=(secret-4af20f79-7640-4c15-b158-39846b7c8680)' \ --env=PASSWORD_TYPE=PARAM_STORE \ '--env=USERNAME=(secret-f62c6163-fa2a-4d0b-8b4e-6cc82dc0f0c1)' \ --env=VERIFY_SSL=FALSE \ ghcr.io/unity-sds/unity-data-services:1.10.1 \ download > /tmp/6pktu7pp/stdout_dapa_download.txt 2> /tmp/6pktu7pp/stderr_dapa_download.txt INFO [job l1a-stage-in-1] Max memory used: 67MiB INFO [job l1a-stage-in-1] completed success INFO [step l1a-stage-in-1] completed success INFO [workflow ] starting step l1a-run-pge INFO [step l1a-run-pge] start INFO [workflow l1a-run-pge] start INFO [workflow l1a-run-pge] starting step l1a_process INFO [step l1a_process] start INFO ['docker', 'pull', 'public.ecr.aws/unity-ads/sounder_sips_l1a_pge:r0.2.0'] r0.2.0: Pulling from unity-ads/sounder_sips_l1a_pge d7bfe07ed847: Pull complete 2e8eaf67b67e: Pull complete 732644f00cd7: Pull complete 4f4fb700ef54: Pull complete d7413cb7e953: Pull complete f5006e242035: Pull complete 4f57eff15618: Pull complete 035e8fad77be: Pull complete d36fd955f407: Pull complete d6d9af327181: Pull complete 2e34d8491065: Pull complete 28f635eb91af: Pull complete 9bd91e81ff3d: Pull complete bccf2a8cadca: Pull complete af54cd59bb64: Pull complete f4618619ba24: Pull complete 199c46d5f1ec: Pull complete bfaf7925739b: Pull complete 8a74aa4320c7: Pull complete fefe7a6488d5: Pull complete Digest: sha256:2079775e5581d693908f0b56b475898f9bfe7ce35f9177ab090ab7d733eef32a Status: Downloaded newer image for public.ecr.aws/unity-ads/sounder_sips_l1a_pge:r0.2.0 INFO [job l1a_process] /tmp/55klcu9h$ docker \ run \ -i \ --mount=type=bind,source=/tmp/55klcu9h,target=/NIGNgi \ --mount=type=bind,source=/tmp/mmkno7ym,target=/tmp \ --mount=type=bind,source=/tmp/6pktu7pp/ephatt,target=/var/lib/cwl/stgad090a1d-ff63-4d1a-bb4b-d11c7b6c8f94/ephatt,readonly \ --mount=type=bind,source=/tmp/hrenmt53/atms_science,target=/var/lib/cwl/stgf9a7818a-9b37-45b7-8e11-21be8d3f2081/atms_science,readonly \ --mount=type=bind,source=/tmp/SOUNDER_SIPS/STATIC_DATA,target=/var/lib/cwl/stg683ed8e2-c96b-4ab7-bd27-24a6c384722d/STATIC_DATA,readonly \ --workdir=/NIGNgi \ --read-only=true \ --log-driver=none \ --user=1000:1000 \ --rm \ --cidfile=/tmp/3iinmgmz/20230420171453-362624.cid \ --env=TMPDIR=/tmp \ --env=HOME=/NIGNgi \ public.ecr.aws/unity-ads/sounder_sips_l1a_pge:r0.2.0 \ /NIGNgi/processed_notebook.ipynb \ -p \ input_ephatt_path \ /var/lib/cwl/stgad090a1d-ff63-4d1a-bb4b-d11c7b6c8f94/ephatt \ -p \ input_science_path \ /var/lib/cwl/stgf9a7818a-9b37-45b7-8e11-21be8d3f2081/atms_science \ -p \ output_path \ /NIGNgi \ -p \ data_static_path \ /var/lib/cwl/stg683ed8e2-c96b-4ab7-bd27-24a6c384722d/STATIC_DATA \ -p \ start_datetime \ 2016-01-14T08:00:00Z \ -p \ end_datetime \ 2016-01-14T11:59:59Z > /tmp/55klcu9h/l1a_pge_stdout.txt 2> /tmp/55klcu9h/l1a_pge_stderr.txt INFO [job l1a_process] Max memory used: 0MiB INFO [job l1a_process] completed success INFO [step l1a_process] completed success INFO [workflow l1a-run-pge] completed success INFO [step l1a-run-pge] completed success INFO [workflow ] starting step l1a-stage-out INFO [step l1a-stage-out] start INFO [job l1a-stage-out] /tmp/fsk1wpdq$ docker \ run \ -i \ --mount=type=bind,source=/tmp/fsk1wpdq,target=/NIGNgi \ --mount=type=bind,source=/tmp/omxw_wcr,target=/tmp \ --mount=type=bind,source=/tmp/55klcu9h,target=/NIGNgi/55klcu9h,readonly \ --workdir=/NIGNgi \ --read-only=true \ --log-driver=none \ --user=1000:1000 \ --rm \ --cidfile=/tmp/cibjn0lo/20230420171810-262195.cid \ --env=TMPDIR=/tmp \ --env=HOME=/NIGNgi \ --env=AWS_REGION=us-west-2 \ '--env=CLIENT_ID=(secret-537db025-591d-4397-a20a-405a57b025da)' \ --env=COGNITO_URL=https://cognito-idp.us-west-2.amazonaws.com \ --env=COLLECTION_ID=SNDR_SNPP_ATMS_L1AOUTPUT1 \ --env=DAPA_API=https://58nbcawrvb.execute-api.us-west-2.amazonaws.com/test \ --env=DELETE_FILES=FALSE \ --env=LOG_LEVEL=20 \ '--env=PASSWORD=(secret-4af20f79-7640-4c15-b158-39846b7c8680)' \ --env=PASSWORD_TYPE=PARAM_STORE \ --env=PROVIDER_ID=SNPP \ --env=STAGING_BUCKET=uds-test-cumulus-staging \ --env=UPLOAD_DIR=/NIGNgi/55klcu9h \ '--env=USERNAME=(secret-f62c6163-fa2a-4d0b-8b4e-6cc82dc0f0c1)' \ --env=VERIFY_SSL=FALSE \ ghcr.io/unity-sds/unity-data-services:1.10.1 \ upload > /tmp/fsk1wpdq/stdout_dapa_upload.txt 2> /tmp/fsk1wpdq/stderr_dapa_upload.txt INFO [job l1a-stage-out] Max memory used: 68MiB INFO [job l1a-stage-out] completed success