niaid / image_portal_workflows

Workflows related to project previously referred to as "Hedwig"
BSD 3-Clause "New" or "Revised" License
5 stars 1 forks source link

SEM-Tomo pipeline has been running for more than 7 hours in QA - 7/5/2023 #257

Closed NetaFG closed 1 year ago

NetaFG commented 1 year ago

Pipeline run: https://prefect1.hedwig-workflow-api.niaidqa.net/default/flow-run/6fb29e11-3ff0-44e9-a592-9079d824f863?schematic

philipmac commented 1 year ago

From the logs: requests.exceptions.ConnectionError: HTTPSConnectionPool(host='prefect1.hedwig-workflow-api.niaidqa.net', port=4200): Max retries exceeded with url: / (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f673d945b40>: Failed to resolve 'prefect1.hedwig-workflow-api.niaidqa.net' ([Errno -2] Name or service not known)"))

Looks like failure of DNS somewhere.

philipmac commented 1 year ago

Testing - DNS Issue does not appear to be ongoing, see callback https://prefect1.hedwig-workflow-api.niaidqa.net/default/flow-run/6904c8d0-cfd5-494f-b3a1-fa6ef1bbe453?logs

Will re run above job

philipmac commented 1 year ago

https://prefect1.hedwig-workflow-api.niaidqa.net/default/flow-run/fda7af72-93c5-4ba1-8b9e-01f4cab432e3

philipmac commented 1 year ago

generate zarr is failing on above job - changing walltime arg to 10 hours, see if job getting killed after 4 hours.

philipmac commented 1 year ago

$ head slurm-995465.out

2023-07-06 13:36:31,949 - distributed.nanny - INFO -         Start Nanny at: 'tcp://10.140.222.184:35942'
...
[2023-07-06 23:28:32-0600] INFO - prefect.gen_zarr[0] | copying /gs1/Scratch/hedwig_qa_scratch/tmpcl1_wwhd/set 1.zarr to /mnt/ai-fas12/RMLEMHedwigQA/Assets/RTB/efischer/hansenbry-2022-0112-NIAID-QA/SEM-Tomo-2022-0112-JanTest/Test1/set 1/set 1.zarr
slurmstepd: error: *** JOB 995465 ON ai-rmlcpu24 CANCELLED AT 2023-07-06T23:36:30 DUE TO TIME LIMIT ***

Appears that for inputs containing 1537 tiffs, 10 hours is still insufficient. eg:

ls /mnt/ai-fas12/RMLEMHedwigQA/Projects/RTB/efischer/hansenbry-2022-0112-NIAID-QA/SEM-Tomo-2022-0112-JanTest/Test1/set\ 1/DENVZ_1.* Bumping to 24 hours.

philipmac commented 1 year ago

https://prefect1.hedwig-workflow-api.niaidqa.net/default/flow-run/54f86d6a-d620-40dd-b90f-6e039f02c844

philipmac commented 1 year ago

seems fixed https://prefect1.hedwig-workflow-api.niaidqa.net/default/flow-run/54f86d6a-d620-40dd-b90f-6e039f02c844 Although - shorter runtime noted.

See log: https://gist.github.com/philipmac/8ecf6b4bf4acb64de0dbf9c117250485