nipy / nipype

Workflows and interfaces for neuroimaging packages
https://nipype.readthedocs.org/en/latest/
Other
750 stars 530 forks source link

Req to deal with SLURM socket errors more patiently #2766

Closed agt24 closed 6 years ago

agt24 commented 6 years ago

Summary

At end of issue #2693 @effigies noted that the error that @dalejn was experiencing was due to the SLURM master throwing an error when it was polled with squeue, possibly because it was busy. After some further testing, we now believe that the NIH HPC SLURM master will throw this error at least once a day even with a modest polling interval.

We would like to request a patch such that if NiPype receives any kind of timeout error (we've seen a few different kinds) from squeue, that it politely waits and tries again.

Actual behavior

RuntimeError: Command:
squeue -j 9448406
Standard output:

Standard error:
slurm_load_jobs error: Socket timed out on send/recv operation
Return code: 1

or

The batch system is not available at the moment.

and NiPype exits

Requested behavior

squeue is busy, will try again

And NiPype does _not_exit

Platform details:

(NiPypeUpdate) [zhoud4@felix ETPB]$ python -c "import nipype; from pprint import pprint; pprint(nipype.get_info())"
{'commit_hash': 'ec7457c23',
 'commit_source': 'installation',
 'networkx_version': '2.2',
 'nibabel_version': '2.3.1',
 'nipype_version': '1.1.3',
 'numpy_version': '1.15.3',
 'pkg_path': '/data/zhoud4/python/envs/NiPypeUpdate/lib/python3.5/site-packages/nipype',
 'scipy_version': '1.1.0',
 'sys_executable': '/data/zhoud4/python/envs/NiPypeUpdate/bin/python',
 'sys_platform': 'linux',
 'sys_version': '3.5.4 | packaged by conda-forge | (default, Aug 10 2017, '
                '01:38:41) \n'
                '[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]',
 'traits_version': '4.6.0'}
(NiPypeUpdate) [zhoud4@felix ETPB]$
(NiPypeUpdate) [zhoud4@biowulf ETPB]$ sinfo -V
slurm 17.02.9
(NiPypeUpdate) [zhoud4@biowulf ETPB]$ 
mgxd commented 6 years ago

this sounds reasonable - I'll take a look