fMRIprep / nipype error while checking node hash

jeroenvanbaar commented 5 years ago

Summary

When running fmriprep 1.5.0rc2 in a singularity container, I get the following nipype.workflow warning. It asked for opening an issue here. I don't get this error when I run fmriprep 1.4.1rc1.

The warning/error:

190828-18:05:54,262 nipype.workflow WARNING:
     Error while checking node hash, forcing re-run. Although this error may not prevent the workflow from running, it could indicate a major problem. Please report a new issue at https://github.com/nipy/nipype/issues adding the following information:

    Node: fmriprep_wf.single_subject_001_wf.func_preproc_ses_1_task_wordReading_run_4_wf.bold_reference_wf.enhance_and_skullstrip_bold_wf.map_brainmask
    Interface: niworkflows.interfaces.fixes.FixHeaderApplyTransforms
    Traceback:
Traceback (most recent call last):

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/plugins/base.py", line 337, in _local_hash_check
    cached, updated = self.procs[jobid].is_cached()

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 297, in is_cached
    hashed_inputs, hashvalue = self._get_hashval()

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 494, in _get_hashval
    self._get_inputs()

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 521, in _get_inputs
    outputs = _load_resultfile(results_file)[0].outputs

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/utils.py", line 300, in load_resultfile
    result = loadpkl(results_file)

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/utils/filemanip.py", line 686, in loadpkl
    pkl_file = pklopen(infile.name, 'rb')

  File "/usr/local/miniconda/lib/python3.7/gzip.py", line 53, in open
    binary_file = GzipFile(filename, gz_mode, compresslevel)

  File "/usr/local/miniconda/lib/python3.7/gzip.py", line 163, in __init__
    fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')

FileNotFoundError: [Errno 2] No such file or directory: 'result_norm.pklz'

Script/Workflow details

My Singularity command:

singularity run --cleanenv -B /gpfs_home/jvanbaar/data/jvanbaar/my_project_folder:/p1,/gpfs/scratch/jvanbaar:/s1 fmriprep-1.5.0rc2.simg /p1/sourcedata /p1/derivatives participant -w /s1 --participant-label sub-001 --fs-license-file /p1/freesurfer.txt --fs-no-reconall --output-spaces MNI152NLin2009cAsym --resource-monitor --write-graph --use-syn-sdc --ignore fieldmaps --n_cpus 4 --mem_mb 32000

Happy to provide more information.

oesteban commented 5 years ago

You seem to be reusing /gpfs/scratch/jvanbaar as work directory. Can you test whether the following command works for you?:

singularity run --cleanenv -B /gpfs_home/jvanbaar/data/jvanbaar/my_project_folder:/p1,/gpfs/scratch/jvanbaar:/s1 fmriprep-1.5.0rc2.simg /p1/sourcedata /p1/derivatives participant -w /s1/work-001 --participant-label sub-001 --fs-license-file /p1/freesurfer.txt --fs-no-reconall --output-spaces MNI152NLin2009cAsym --resource-monitor --write-graph --use-syn-sdc --ignore fieldmaps --n_cpus 4 --mem_mb 32000

jeroenvanbaar commented 5 years ago

Thanks for the quick response! I'm now using /s1/sub-001 as work directory, but am still getting the same error. Different traceback though:

190828-19:03:41,432 nipype.workflow WARNING:
     Error while checking node hash, forcing re-run. Although this error may not prevent the workflow from running, it could indicate a major problem. Please report a new issue at https://github.com/nipy/nipype/issues adding the following information:

    Node: fmriprep_wf.single_subject_001_wf.func_preproc_ses_1_task_wordReading_run_1_wf.bold_reference_wf.gen_ref
    Interface: niworkflows.interfaces.registration.EstimateReferenceImage
    Traceback:
Traceback (most recent call last):

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/plugins/base.py", line 337, in _local_hash_check
    cached, updated = self.procs[jobid].is_cached()

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 297, in is_cached
    hashed_inputs, hashvalue = self._get_hashval()

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 494, in _get_hashval
    self._get_inputs()

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 521, in _get_inputs
    outputs = _load_resultfile(results_file)[0].outputs

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/utils.py", line 311, in load_resultfile
    if resolve and result.outputs:

AttributeError: 'NoneType' object has no attribute 'outputs'

oesteban commented 5 years ago

Okay, this is not the same error. Are you trying to parallelize several subjects running several fmriprep processes separately?

jeroenvanbaar commented 5 years ago

Yes, this is when running 10 separate fmriprep processes in parallel on separate HPC nodes (10 subjects). I defined a different scratch folder for each process

oesteban commented 5 years ago

Can you post your submission script?

jeroenvanbaar commented 5 years ago

Here's my slurm submission script:

#!/bin/bash

#SBATCH -J fMRIprep

#SBATCH -c 4
#SBATCH --mem=32G
#SBATCH --account=my_account_name
#SBATCH -t 24:00:00

#SBATCH --array=1,2,4,5,6,7,8,10,11,13
#SBATCH -e fmriprep_logs/fmriprep_sub-%a.err
#SBATCH -o fmriprep_logs/fmriprep_sub-%a.out

#SBATCH --mail-type=ALL
#SBATCH --mail-user=my_email_address

echo "Starting fMRIprep for subject "$SLURM_ARRAY_TASK_ID

subject_dir=$(printf "sub-%03d" $SLURM_ARRAY_TASK_ID)
scratch_dir=$(printf "/s1/sub-%03d" $SLURM_ARRAY_TASK_ID)

echo "Scratch directory: "$scratch_dir

singularity run --cleanenv -B /gpfs_home/jvanbaar/data/jvanbaar/my_project_folder:/p1,/gpfs/scratch/jvanbaar:/s1 fmriprep-1.5.0rc2.simg /p1/sourcedata /p1/derivatives participant -w $scratch_dir --participant-label $subject_dir --fs-license-file /p1/freesurfer.txt --fs-no-reconall --output-spaces MNI152NLin2009cAsym --resource-monitor --write-graph --use-syn-sdc --ignore fieldmaps --n_cpus 4 --mem_mb 32000

mattcieslak commented 5 years ago

I am also getting this error whenever running a workflow in 1.2.1 that has mapnodes and is using multiprocessing. It happens in about 40% of the time I run a workflow and occurs randomly. If I re-run the workflow multiple times it will eventually succeed. So it's not specific to fmriprep, I think it's nipype 1.2.1.

Edit: I'm getting this error both when using singularity 3.3.0 and docker

oesteban commented 5 years ago

@mattcieslak could you write a failing example for Nipype that we can use to debug this?

oesteban commented 5 years ago

Okay, I can see this is currently happening in fMRIPrep's master - I'll take this one.

oesteban commented 5 years ago

@jeroenvanbaar can you provide full logs from fMRIPrep?

@mattcieslak I've opened the issue referenced above for the case of MapNodes.

oesteban commented 5 years ago

It seems you can ignore this warning, as it seems not to stop the workflow:

190828-19:03:41,432 nipype.workflow WARNING:
     Error while checking node hash, forcing re-run. Although this error may not prevent the workflow from running, it could indicate a major problem. Please report a new issue at https://github.com/nipy/nipype/issues adding the following information:

    Node: fmriprep_wf.single_subject_001_wf.func_preproc_ses_1_task_wordReading_run_1_wf.bold_reference_wf.gen_ref
    Interface: niworkflows.interfaces.registration.EstimateReferenceImage
    Traceback:
Traceback (most recent call last):

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/plugins/base.py", line 337, in _local_hash_check
    cached, updated = self.procs[jobid].is_cached()

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 297, in is_cached
    hashed_inputs, hashvalue = self._get_hashval()

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 494, in _get_hashval
    self._get_inputs()

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 521, in _get_inputs
    outputs = _load_resultfile(results_file)[0].outputs

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/utils.py", line 311, in load_resultfile
    if resolve and result.outputs:

AttributeError: 'NoneType' object has no attribute 'outputs'

190828-19:03:41,434 nipype.workflow INFO:
     [Node] Setting-up "fmriprep_wf.single_subject_001_wf.func_preproc_ses_1_task_wordReading_run_1_wf.bold_reference_wf.gen_ref" in "/s1/sub-001/fmriprep_wf/single_subject_001_wf/func_preproc_ses_1_task_wordReading_run_1_wf/bold_reference_wf/gen_ref".
190828-19:03:41,442 nipype.workflow INFO:
     [Node] Cached "fmriprep_wf.single_subject_001_wf.func_preproc_ses_1_task_wordReading_run_1_wf.bold_reference_wf.gen_ref" - collecting precomputed outputs
190828-19:03:41,442 nipype.workflow INFO:
     [Node] "fmriprep_wf.single_subject_001_wf.func_preproc_ses_1_task_wordReading_run_1_wf.bold_reference_wf.gen_ref" found cached.

Nonetheless, we'll try to find out why this warning is being overused.

oesteban commented 5 years ago

Another one. This time it happened after the interface was run:

190910-02:31:11,300 nipype.utils ERROR:
     No metadata was found in the pkl file. Make sure you are currently using the same Nipype version from the generated pkl.
190910-02:31:11,346 nipype.workflow ERROR:
     Node add_dvars_header failed to run on host 1064e7e45ec5.
190910-02:31:11,394 nipype.workflow ERROR:
     Saving crash info to /out/fmriprep/sub-01/log/20190910-021645_86b7c962-680e-4c02-a110-e8aef9529849/crash-20190910-023111-UID1001-add_dvars_header-8b4860c1-2a0c-4a70-a879-961f5dbdc4d8.txt
Traceback (most recent call last):
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/plugins/legacymultiproc.py", line 381, in _send_procs_to_workers
    self.procs[jobid].run(updatehash=updatehash)
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 488, in run
    self, report_type='postexec', is_mapnode=isinstance(self, MapNode))
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/utils.py", line 152, in write_report
    result = node.result  # Locally cache result
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 198, in result
    op.join(self.output_dir(), 'result_%s.pklz' % self.name))[0]
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/utils.py", line 311, in load_resultfile
    if resolve and result.outputs:
AttributeError: 'NoneType' object has no attribute 'outputs'

190910-02:31:11,473 nipype.workflow INFO:
     [Job 238] Completed (fmriprep_wf.single_subject_01_wf.func_preproc_task_mixedgamblestask_run_02_wf.bold_confounds_wf.add_dvars_header).

EDIT: Added the link, and checked that just rerunning the build (i.e. no clearing of cache) did work out. - might be some sort of synchronization issue.

satra commented 5 years ago

@oesteban - something seems to have fundamentally changed for these errors to start popping up, which makes me worry a bit.

can we run it with config option (stop on first crash = True, and also increase the timeout - although normally that should not affect things locally)?

also the call to results.outputs happens after this check:

    if not results_file.exists():
        return result, aggregate, attribute_error

which also suggests a write-timing/results availability issue. i.e. the plugin has returned control without finishing writing a results file. which i didn't think could happen for multiproc, but now that we are using concurrent_futures with a future, this can easily happen, especially on a system like circle, where execution speeds can be limited.

how about we try out SoftFileLock from filelock (which we are using in pydra), when saving and loading result files? if this works, that may help a few areas.

oesteban commented 5 years ago

something seems to have fundamentally changed for these errors to start popping up, which makes me worry a bit.

We're on the same page :(

can we run it with config option (stop on first crash = True, and also increase the timeout - although normally that should not affect things locally)?

Yes, I think setting stop on first crash true is a good idea for fMRIPrep anyways.

which also suggests a write-timing/results availability issue. i.e. the plugin has returned control without finishing writing a results file. which i didn't think could happen for multiproc, but now that we are using concurrent_futures with a future, this can easily happen, especially on a system like circle, where execution speeds can be limited.

This happened with LegacyMultiProc, so there must be something else. But yes, I've been looking into this and I can only think of two possibilities: write-timing/results availability and that the results object gets mutated elsewhere (which seems impossible, btw).

how about we try out SoftFileLock from filelock (which we are using in pydra), when saving and loading result files? if this works, that may help a few areas.

Sounds good, of course.

oesteban commented 5 years ago

This has been addressed with #3024. I'll leave #3014 open as the changes we've been including might not have solved that one.

oesteban commented 4 years ago

Interestingly, this error (warning) on regular nodes seems to be really short-lived:

191010-06:42:30,318 nipype.utils ERROR:
         No metadata was found in the pkl file. Make sure you are currently using the same Nipype version from the generated pkl.
191010-06:42:30,324 nipype.workflow WARNING:
         Error while checking node hash, forcing re-run. Although this error may not prevent the workflow from running, it could indicate a major problem. Please report a new issue
at https://github.com/nipy/nipype/issues adding the following information:

        Node: fmriprep_wf.single_subject_NDARINV00CY2MDM_wf.func_preproc_ses_baselineYear1Arm1_task_rest_run_04_wf.bold_reg_wf.bbreg_wf.bbregister
        Interface: niworkflows.interfaces.freesurfer.PatchedBBRegisterRPT
        Traceback:
Traceback (most recent call last):

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/plugins/base.py", line 337, in _local_hash_check
    cached, updated = self.procs[jobid].is_cached()

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 297, in is_cached
    hashed_inputs, hashvalue = self._get_hashval()

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 494, in _get_hashval
    self._get_inputs()

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 521, in _get_inputs
    outputs = _load_resultfile(results_file)[0].outputs

  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/utils.py", line 311, in load_resultfile
    if resolve and result.outputs:

AttributeError: 'NoneType' object has no attribute 'outputs'

191010-06:42:30,325 nipype.workflow INFO:
         [Node] Setting-up "fmriprep_wf.single_subject_NDARINV00CY2MDM_wf.func_preproc_ses_baselineYear1Arm1_task_rest_run_04_wf.bold_reg_wf.bbreg_wf.bbregister" in "/scratch/03763/oesteban/fmriprep-work/sub-NDARINV00CY2MDM/fmriprep_wf/single_subject_NDARINV00CY2MDM_wf/func_preproc_ses_baselineYear1Arm1_task_rest_run_04_wf/bold_reg_wf/bbreg_wf/bbregister".
191010-06:42:30,353 nipype.workflow INFO:
         [Job 615] Cached (fmriprep_wf.single_subject_NDARINV00CY2MDM_wf.func_preproc_ses_baselineYear1Arm1_task_rest_run_03_wf.bold_reg_wf.bbreg_wf.mri_coreg).
191010-06:42:30,378 nipype.workflow INFO:
         [Node] Cached "fmriprep_wf.single_subject_NDARINV00CY2MDM_wf.func_preproc_ses_baselineYear1Arm1_task_rest_run_04_wf.bold_reg_wf.bbreg_wf.bbregister" - collecting precomputed outputs
191010-06:42:30,378 nipype.workflow INFO:
         [Node] "fmriprep_wf.single_subject_NDARINV00CY2MDM_wf.func_preproc_ses_baselineYear1Arm1_task_rest_run_04_wf.bold_reg_wf.bbreg_wf.bbregister" found cached.

When the node is checked by the execution plugin, we see the warning is issued when trying to read the outputs of a prior node feeding into the bbregister node. A few seconds later, the results files for the inputs are checked again and now it works - the node is found cached.

nipy / nipype