niaid / image_portal_workflows

Workflows related to project previously referred to as "Hedwig"
BSD 3-Clause "New" or "Revised" License
5 stars 1 forks source link

BRT pipeline possible refactor: Waiting for all mapped task runs to finish when partially finished tasks could go on with rest of the pipeline #417

Closed annshress closed 7 months ago

annshress commented 8 months ago

Shown image shows some of the tasks have already completed, while some are still running. However, completed tasks could pass their flow runs to next tasks rather than waiting for sister-mapped-tasks to complete

image

annshress commented 8 months ago

BRT workflow currently fails when handling too many files ( 30 and more .mrc files of size 1G ).

philipmac commented 8 months ago

https://prefect2.hedwig-workflow-api.niaiddev.net/task-runs/task-run/a21243dd-e0de-4f2c-9627-c27af7071dd4

then hangs

philipmac commented 8 months ago

slurm-log/0ae30525-93c1-4c6d-944b-eda31b6ed801/dask-worker-1122178.err

15:57:06.557 | INFO | Task run 'gen_ng_metadata-9' - Instantiating HWZarrImages /gs1/Scratch/hedwig_dev_scratch/tmp9z5qokp4/2013-1220-dA30_5-BSC-1_19_rec.zarr 15:57:06.560 | INFO | Task run 'gen_ng_metadata-9' - Accessing first HWZarrImage 15:57:06.563 | INFO | Task run 'gen_ng_metadata-9' - Creating ng metadata 15:57:06.563 | INFO | Task run 'gen_ng_metadata-9' - ... getting shader type 15:57:06.564 | INFO | Task run 'gen_ng_metadata-9' - ... getting dims 15:57:06.565 | INFO | Task run 'gen_ng_metadata-9' - ... getting shader params 15:57:06.565 | DEBUG | pytools.HedwigZarrImage - path: 0/0 15:57:06.567 | INFO | pytools.utils.histogram - ZARR array needs converting to native byteorder. ... job-extra': None, 'job-extra-directives': [], 'job-directives-skip': [], 'log-directory': None, 'scheduler-options': {}}}} 15:57:06.570 | INFO | pytools.HedwigZarrImage - Building histogram for "/gs1/Scratch/hedwig_dev_scratch/tmp9z5qokp4/2013-1220-dA30_5-BSC-1_19_rec.zarr/0"...

job never returns.

concurrency set to 1.

annshress commented 8 months ago

This issue has been avoided using enough number of cores for the workflow.

annshress commented 7 months ago

Resolved. Test case: https://prefect2.hedwig-workflow-api.niaiddev.net/flow-runs/flow-run/43c12d8e-30a6-4588-b97a-40db2477741a