nipy / nipype

Workflows and interfaces for neuroimaging packages
https://nipype.readthedocs.org/en/latest/
Other
750 stars 530 forks source link

MapNode stuck when running on SLURM plugin. #3425

Open JRJacoby opened 2 years ago

JRJacoby commented 2 years ago

Summary

A MapNode gets stuck trying to submit a job to the cluster.

Actual behavior

I have a data loading function that passes lists into a MapNode. Each subnode that the MapNode creates runs successfully, but then the worflow never progresses to the next MapNode. Here's the last section of the debug log:

211215-15:45:34,207 nipype.workflow DEBUG: [Node] No hashfiles found in "/autofs/vast/citadel/studies/hcpa/users/john/analyses/12_14_2021_HCAP_WM_segmentations/outputs/nipype_base_dir/ATT_WM_segstats/vol2vol". 211215-15:45:34,213 nipype.workflow DEBUG: Checking hash "ATT_WM_segstats.vol2vol" locally: cached=False, updated=False. 211215-15:45:34,510 nipype.workflow DEBUG: Ran command (sbatch --account bandlab --partition basic --mem 4GB --time 48:00:00 -o /autofs/vast/citadel/studies/hcpa/users/john/analyses/12_14_2021_HCAP_WM_segmentations/outputs/nipype_base_dir/ATT_WM_segstats/batch/slurm-%j.out -e /autofs/vast/citadel/studies/hcpa/users/john/analyses/12_14_2021_HCAP_WM_segmentations/outputs/nipype_base_dir/ATT_WM_segstats/batch/slurm-%j.out -J vol2vol.ATT_WM_segstats.jj1006 /autofs/vast/citadel/studies/hcpa/users/john/analyses/12_14_2021_HCAP_WM_segmentations/outputs/nipype_base_dir/ATT_WM_segstats/batch/batchscript_pyscript_20211215_154534_ATT_WM_segstats_vol2vol.sh) 211215-15:45:34,517 nipype.workflow DEBUG: submitted sbatch task: 752954 for node vol2vol 211215-15:45:34,523 nipype.workflow INFO: Finished submitting: ATT_WM_segstats.vol2vol ID: 0 211215-15:45:34,530 nipype.workflow DEBUG: Slots available: None 211215-15:45:34,539 nipype.workflow DEBUG: Progress: 681 jobs, 678/1/0 (done/running/ready), 1/2 (pendingtasks/waiting). 211215-15:45:34,591 nipype.interface DEBUG: args-j 752954 211215-15:45:34,725 nipype.workflow DEBUG: Tasks currently running: 1. Pending: 1. 211215-15:45:34,738 nipype.workflow DEBUG: Slots available: None 211215-15:45:36,585 nipype.interface DEBUG: args-j 752954 211215-15:45:36,725 nipype.workflow DEBUG: Slots available: None 211215-15:45:38,588 nipype.interface DEBUG: args-j 752954 211215-15:45:38,726 nipype.workflow DEBUG: Slots available: None 211215-15:45:40,594 nipype.interface DEBUG: args_-j 752954 211215-15:45:40,733 nipype.workflow DEBUG: Slots available: None

And it just continues like that indefinitely.

Expected behavior

For the whole workflow to run all the way through.

How to replicate the behavior

Not exactly sure what's causing it. Using any MapNode with the SLURM plugin seems to do it. I pasted my entire script down below.

Script/Workflow details

Here's the whole code:

# IMPORTS
import os
import re
from nipype.interfaces.freesurfer.preprocess import ApplyVolTransform, BBRegister
from nipype.interfaces.freesurfer.model import SegStats
import pandas as pd
import numpy as np
import plotnine as p9
from get_data import get_dataframe
from nipype import MapNode, Workflow, Node, Function, config, logging
from nipype_interfaces import AsegStats2Table

# CONSTANT DEFINITIONS

# FUNCTIONS DEFINITIONS

def load_data():
   df = get_dataframe('HCAP', 'scan', ['ATT', 'control', 'recon_path', 'recon', 'asl_to_structural'])
   df = df.replace('', np.nan).dropna()

   df['wm_mask'] = df['recon_path'] + 'mri/wm.mgz'
   df['wm_seg'] = df['recon_path'] + 'mri/wmparc.mgz'
   df['ATT_segstats_output'] = df['recon_path'] + 'stats/att.wmparc.sum'
   df['ATT_volume_anat_space'] = df['recon_path'] + 'mri/ATT.nii'

   # df = df.loc[[0, 1], :]

   return df.to_dict(orient='list')

# NODE DEFINITIONS

data = load_data()

vol2vol = MapNode(
   ApplyVolTransform(
      fs_target=True,
      subjects_dir='/autofs/vast/citadel/studies/hcpa/data/recons',
      ignore_exception=True
   ),
   name='vol2vol',
   iterfield=['source_file', 'reg_file', 'transformed_file'],
)
vol2vol.inputs.source_file = data['ATT']
vol2vol.inputs.transformed_file = data['ATT_volume_anat_space']
vol2vol.inputs.reg_file = data['asl_to_structural']

segstats = MapNode(
   SegStats(
      mask_erode=1,
      default_color_table=True,
      ignore_exception=True
   ),
   name='segstats',
   iterfield=['mask_file', 'in_file', 'segmentation_file', 'summary_file'],
)
segstats.inputs.segmentation_file = data['wm_seg']
segstats.inputs.mask_file = data['wm_mask']
segstats.inputs.summary_file = data['ATT_segstats_output']

asegstats = Node(
   AsegStats2Table(
      subjects_dir='/autofs/vast/citadel/studies/hcpa/data/recons',
      subject_ids=data['recon'],
      out_file='/autofs/vast/citadel/studies/hcpa/users/john/analyses/12_14_2021_HCAP_WM_segmentations/outputs/att.asegstats2table.txt',
      subdir='stats',
      stats='att.wmparc.sum'
   ),
   name='asegstats'
)

# WORKFLOW DEFINITIONS

wf = Workflow(
   name='ATT_WM_segstats',
   base_dir='/autofs/vast/citadel/studies/hcpa/users/john/analyses/12_14_2021_HCAP_WM_segmentations/outputs/nipype_base_dir'
)
wf.connect(
   [
      (vol2vol, segstats, [('transformed_file', 'in_file')]),
      (segstats, asegstats, [('summary_file', 'null')])
   ]
)
config.update_config({'logging': {'log_directory': os.path.join(os.getcwd(), 'resources/logs'),
                                  'log_to_file': True,
                                  'workflow_level': 'DEBUG',
                                  'interface_level': 'DEBUG'},
                        'execution': {'hash_method': 'content'}})
logging.update_logging(config)
wf.run(
   plugin='SLURM',
   plugin_args={'sbatch_args': '--account bandlab --partition basic --mem 4GB --time 48:00:00'}
)

# wf.run()

Platform details:

{'commit_hash': 'b385720',
 'commit_source': 'installation',
 'networkx_version': '2.5.1',
 'nibabel_version': '3.2.1',
 'nipype_version': '1.7.0',
 'numpy_version': '1.19.5',
 'pkg_path': '/autofs/space/nihilus_001/users/john/analyses/pyenv/lib64/python3.6/site-packages/nipype',
 'scipy_version': '1.5.4',
 'sys_executable': '/autofs/space/nihilus_001/users/john/analyses/pyenv/bin/python',
 'sys_platform': 'linux',
 'sys_version': '3.6.8 (default, May  8 2021, 09:11:34) \n'
                '[GCC 8.4.1 20210423 (Red Hat 8.4.1-2)]',
 'traits_version': '6.3.2'}

Execution environment

Choose one

JRJacoby commented 2 years ago

So eventually it continued with"

211215-15:57:08,660 nipype.workflow DEBUG:
     adding multipath trait: segmentation_file
211215-15:57:08,667 nipype.workflow DEBUG:
     adding multipath trait: summary_file
211215-15:57:10,591 nipype.workflow DEBUG:
     [Node] Setting 1 connected inputs of node "segstats" from 1 previous nodes.
211215-15:57:10,673 nipype.workflow DEBUG:
     Outputs object of loaded result /autofs/vast/citadel/studies/hcpa/users/john/analyses/12_14_2021_HCAP_WM_segmentations/outputs/nipype_base_dir/ATT_WM_segstats/vol2vol/result_vol2vol.pklz is a Bunch.
211215-15:57:10,685 nipype.workflow DEBUG:
     output: transformed_file
211215-15:57:10,695 nipype.workflow DEBUG:

So it looks like it was in fact doing something that whole time. Does anyone know what? The next MapNode (segstats) is now stuck in the same way - all the subnodes ran and then the main segstats node itself is running as a job. What's that job doing? It's been running much longer than a segstats command should take, all while outputting that same "Slots available: None" message.