Closed NetaFG closed 1 year ago
From the logs: requests.exceptions.ConnectionError: HTTPSConnectionPool(host='prefect1.hedwig-workflow-api.niaidqa.net', port=4200): Max retries exceeded with url: / (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f673d945b40>: Failed to resolve 'prefect1.hedwig-workflow-api.niaidqa.net' ([Errno -2] Name or service not known)"))
Looks like failure of DNS somewhere.
Testing - DNS Issue does not appear to be ongoing, see callback https://prefect1.hedwig-workflow-api.niaidqa.net/default/flow-run/6904c8d0-cfd5-494f-b3a1-fa6ef1bbe453?logs
Will re run above job
generate zarr is failing on above job - changing walltime
arg to 10 hours, see if job getting killed after 4 hours.
$ head slurm-995465.out
2023-07-06 13:36:31,949 - distributed.nanny - INFO - Start Nanny at: 'tcp://10.140.222.184:35942'
...
[2023-07-06 23:28:32-0600] INFO - prefect.gen_zarr[0] | copying /gs1/Scratch/hedwig_qa_scratch/tmpcl1_wwhd/set 1.zarr to /mnt/ai-fas12/RMLEMHedwigQA/Assets/RTB/efischer/hansenbry-2022-0112-NIAID-QA/SEM-Tomo-2022-0112-JanTest/Test1/set 1/set 1.zarr
slurmstepd: error: *** JOB 995465 ON ai-rmlcpu24 CANCELLED AT 2023-07-06T23:36:30 DUE TO TIME LIMIT ***
Appears that for inputs containing 1537 tiffs, 10 hours is still insufficient. eg:
ls /mnt/ai-fas12/RMLEMHedwigQA/Projects/RTB/efischer/hansenbry-2022-0112-NIAID-QA/SEM-Tomo-2022-0112-JanTest/Test1/set\ 1/DENVZ_1.*
Bumping to 24 hours.
seems fixed https://prefect1.hedwig-workflow-api.niaidqa.net/default/flow-run/54f86d6a-d620-40dd-b90f-6e039f02c844 Although - shorter runtime noted.
See log: https://gist.github.com/philipmac/8ecf6b4bf4acb64de0dbf9c117250485
Pipeline run: https://prefect1.hedwig-workflow-api.niaidqa.net/default/flow-run/6fb29e11-3ff0-44e9-a592-9079d824f863?schematic