At end of issue #2693 @effigies noted that the error that @dalejn was experiencing was due to the SLURM master throwing an error when it was polled with squeue, possibly because it was busy. After some further testing, we now believe that the NIH HPC SLURM master will throw this error at least once a day even with a modest polling interval.
We would like to request a patch such that if NiPype receives any kind of timeout error (we've seen a few different kinds) from squeue, that it politely waits and tries again.
Actual behavior
RuntimeError: Command:
squeue -j 9448406
Standard output:
Standard error:
slurm_load_jobs error: Socket timed out on send/recv operation
Return code: 1
Summary
At end of issue #2693 @effigies noted that the error that @dalejn was experiencing was due to the SLURM master throwing an error when it was polled with squeue, possibly because it was busy. After some further testing, we now believe that the NIH HPC SLURM master will throw this error at least once a day even with a modest polling interval.
We would like to request a patch such that if NiPype receives any kind of timeout error (we've seen a few different kinds) from squeue, that it politely waits and tries again.
Actual behavior
or
and NiPype exits
Requested behavior
And NiPype does _not_exit
Platform details: