Closed jeroenvanbaar closed 5 years ago
You seem to be reusing /gpfs/scratch/jvanbaar
as work directory. Can you test whether the following command works for you?:
singularity run --cleanenv -B /gpfs_home/jvanbaar/data/jvanbaar/my_project_folder:/p1,/gpfs/scratch/jvanbaar:/s1 fmriprep-1.5.0rc2.simg /p1/sourcedata /p1/derivatives participant -w /s1/work-001 --participant-label sub-001 --fs-license-file /p1/freesurfer.txt --fs-no-reconall --output-spaces MNI152NLin2009cAsym --resource-monitor --write-graph --use-syn-sdc --ignore fieldmaps --n_cpus 4 --mem_mb 32000
Thanks for the quick response! I'm now using /s1/sub-001 as work directory, but am still getting the same error. Different traceback though:
190828-19:03:41,432 nipype.workflow WARNING:
Error while checking node hash, forcing re-run. Although this error may not prevent the workflow from running, it could indicate a major problem. Please report a new issue at https://github.com/nipy/nipype/issues adding the following information:
Node: fmriprep_wf.single_subject_001_wf.func_preproc_ses_1_task_wordReading_run_1_wf.bold_reference_wf.gen_ref
Interface: niworkflows.interfaces.registration.EstimateReferenceImage
Traceback:
Traceback (most recent call last):
File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/plugins/base.py", line 337, in _local_hash_check
cached, updated = self.procs[jobid].is_cached()
File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 297, in is_cached
hashed_inputs, hashvalue = self._get_hashval()
File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 494, in _get_hashval
self._get_inputs()
File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 521, in _get_inputs
outputs = _load_resultfile(results_file)[0].outputs
File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/utils.py", line 311, in load_resultfile
if resolve and result.outputs:
AttributeError: 'NoneType' object has no attribute 'outputs'
Okay, this is not the same error. Are you trying to parallelize several subjects running several fmriprep processes separately?
Yes, this is when running 10 separate fmriprep processes in parallel on separate HPC nodes (10 subjects). I defined a different scratch folder for each process
Can you post your submission script?
Here's my slurm submission script:
#!/bin/bash
#SBATCH -J fMRIprep
#SBATCH -c 4
#SBATCH --mem=32G
#SBATCH --account=my_account_name
#SBATCH -t 24:00:00
#SBATCH --array=1,2,4,5,6,7,8,10,11,13
#SBATCH -e fmriprep_logs/fmriprep_sub-%a.err
#SBATCH -o fmriprep_logs/fmriprep_sub-%a.out
#SBATCH --mail-type=ALL
#SBATCH --mail-user=my_email_address
echo "Starting fMRIprep for subject "$SLURM_ARRAY_TASK_ID
subject_dir=$(printf "sub-%03d" $SLURM_ARRAY_TASK_ID)
scratch_dir=$(printf "/s1/sub-%03d" $SLURM_ARRAY_TASK_ID)
echo "Scratch directory: "$scratch_dir
singularity run --cleanenv -B /gpfs_home/jvanbaar/data/jvanbaar/my_project_folder:/p1,/gpfs/scratch/jvanbaar:/s1 fmriprep-1.5.0rc2.simg /p1/sourcedata /p1/derivatives participant -w $scratch_dir --participant-label $subject_dir --fs-license-file /p1/freesurfer.txt --fs-no-reconall --output-spaces MNI152NLin2009cAsym --resource-monitor --write-graph --use-syn-sdc --ignore fieldmaps --n_cpus 4 --mem_mb 32000
I am also getting this error whenever running a workflow in 1.2.1 that has mapnodes and is using multiprocessing. It happens in about 40% of the time I run a workflow and occurs randomly. If I re-run the workflow multiple times it will eventually succeed. So it's not specific to fmriprep, I think it's nipype 1.2.1.
Edit: I'm getting this error both when using singularity 3.3.0 and docker
@mattcieslak could you write a failing example for Nipype that we can use to debug this?
Okay, I can see this is currently happening in fMRIPrep's master - I'll take this one.
@jeroenvanbaar can you provide full logs from fMRIPrep?
@mattcieslak I've opened the issue referenced above for the case of MapNodes.
It seems you can ignore this warning, as it seems not to stop the workflow:
190828-19:03:41,432 nipype.workflow WARNING:
Error while checking node hash, forcing re-run. Although this error may not prevent the workflow from running, it could indicate a major problem. Please report a new issue at https://github.com/nipy/nipype/issues adding the following information:
Node: fmriprep_wf.single_subject_001_wf.func_preproc_ses_1_task_wordReading_run_1_wf.bold_reference_wf.gen_ref
Interface: niworkflows.interfaces.registration.EstimateReferenceImage
Traceback:
Traceback (most recent call last):
File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/plugins/base.py", line 337, in _local_hash_check
cached, updated = self.procs[jobid].is_cached()
File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 297, in is_cached
hashed_inputs, hashvalue = self._get_hashval()
File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 494, in _get_hashval
self._get_inputs()
File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 521, in _get_inputs
outputs = _load_resultfile(results_file)[0].outputs
File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/utils.py", line 311, in load_resultfile
if resolve and result.outputs:
AttributeError: 'NoneType' object has no attribute 'outputs'
190828-19:03:41,434 nipype.workflow INFO:
[Node] Setting-up "fmriprep_wf.single_subject_001_wf.func_preproc_ses_1_task_wordReading_run_1_wf.bold_reference_wf.gen_ref" in "/s1/sub-001/fmriprep_wf/single_subject_001_wf/func_preproc_ses_1_task_wordReading_run_1_wf/bold_reference_wf/gen_ref".
190828-19:03:41,442 nipype.workflow INFO:
[Node] Cached "fmriprep_wf.single_subject_001_wf.func_preproc_ses_1_task_wordReading_run_1_wf.bold_reference_wf.gen_ref" - collecting precomputed outputs
190828-19:03:41,442 nipype.workflow INFO:
[Node] "fmriprep_wf.single_subject_001_wf.func_preproc_ses_1_task_wordReading_run_1_wf.bold_reference_wf.gen_ref" found cached.
Nonetheless, we'll try to find out why this warning is being overused.
Another one. This time it happened after the interface was run:
190910-02:31:11,300 nipype.utils ERROR:
No metadata was found in the pkl file. Make sure you are currently using the same Nipype version from the generated pkl.
190910-02:31:11,346 nipype.workflow ERROR:
Node add_dvars_header failed to run on host 1064e7e45ec5.
190910-02:31:11,394 nipype.workflow ERROR:
Saving crash info to /out/fmriprep/sub-01/log/20190910-021645_86b7c962-680e-4c02-a110-e8aef9529849/crash-20190910-023111-UID1001-add_dvars_header-8b4860c1-2a0c-4a70-a879-961f5dbdc4d8.txt
Traceback (most recent call last):
File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/plugins/legacymultiproc.py", line 381, in _send_procs_to_workers
self.procs[jobid].run(updatehash=updatehash)
File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 488, in run
self, report_type='postexec', is_mapnode=isinstance(self, MapNode))
File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/utils.py", line 152, in write_report
result = node.result # Locally cache result
File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 198, in result
op.join(self.output_dir(), 'result_%s.pklz' % self.name))[0]
File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/utils.py", line 311, in load_resultfile
if resolve and result.outputs:
AttributeError: 'NoneType' object has no attribute 'outputs'
190910-02:31:11,473 nipype.workflow INFO:
[Job 238] Completed (fmriprep_wf.single_subject_01_wf.func_preproc_task_mixedgamblestask_run_02_wf.bold_confounds_wf.add_dvars_header).
EDIT: Added the link, and checked that just rerunning the build (i.e. no clearing of cache) did work out. - might be some sort of synchronization issue.
@oesteban - something seems to have fundamentally changed for these errors to start popping up, which makes me worry a bit.
can we run it with config option (stop on first crash = True, and also increase the timeout - although normally that should not affect things locally)?
also the call to results.outputs happens after this check:
if not results_file.exists():
return result, aggregate, attribute_error
which also suggests a write-timing/results availability issue. i.e. the plugin has returned control without finishing writing a results file. which i didn't think could happen for multiproc, but now that we are using concurrent_futures with a future, this can easily happen, especially on a system like circle, where execution speeds can be limited.
how about we try out SoftFileLock
from filelock (which we are using in pydra), when saving and loading result files? if this works, that may help a few areas.
something seems to have fundamentally changed for these errors to start popping up, which makes me worry a bit.
We're on the same page :(
can we run it with config option (stop on first crash = True, and also increase the timeout - although normally that should not affect things locally)?
Yes, I think setting stop on first crash true is a good idea for fMRIPrep anyways.
which also suggests a write-timing/results availability issue. i.e. the plugin has returned control without finishing writing a results file. which i didn't think could happen for multiproc, but now that we are using concurrent_futures with a future, this can easily happen, especially on a system like circle, where execution speeds can be limited.
This happened with LegacyMultiProc
, so there must be something else. But yes, I've been looking into this and I can only think of two possibilities: write-timing/results availability and that the results object gets mutated elsewhere (which seems impossible, btw).
how about we try out
SoftFileLock
from filelock (which we are using in pydra), when saving and loading result files? if this works, that may help a few areas.
Sounds good, of course.
This has been addressed with #3024. I'll leave #3014 open as the changes we've been including might not have solved that one.
Interestingly, this error (warning) on regular nodes seems to be really short-lived:
191010-06:42:30,318 nipype.utils ERROR:
No metadata was found in the pkl file. Make sure you are currently using the same Nipype version from the generated pkl.
191010-06:42:30,324 nipype.workflow WARNING:
Error while checking node hash, forcing re-run. Although this error may not prevent the workflow from running, it could indicate a major problem. Please report a new issue
at https://github.com/nipy/nipype/issues adding the following information:
Node: fmriprep_wf.single_subject_NDARINV00CY2MDM_wf.func_preproc_ses_baselineYear1Arm1_task_rest_run_04_wf.bold_reg_wf.bbreg_wf.bbregister
Interface: niworkflows.interfaces.freesurfer.PatchedBBRegisterRPT
Traceback:
Traceback (most recent call last):
File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/plugins/base.py", line 337, in _local_hash_check
cached, updated = self.procs[jobid].is_cached()
File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 297, in is_cached
hashed_inputs, hashvalue = self._get_hashval()
File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 494, in _get_hashval
self._get_inputs()
File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 521, in _get_inputs
outputs = _load_resultfile(results_file)[0].outputs
File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/utils.py", line 311, in load_resultfile
if resolve and result.outputs:
AttributeError: 'NoneType' object has no attribute 'outputs'
191010-06:42:30,325 nipype.workflow INFO:
[Node] Setting-up "fmriprep_wf.single_subject_NDARINV00CY2MDM_wf.func_preproc_ses_baselineYear1Arm1_task_rest_run_04_wf.bold_reg_wf.bbreg_wf.bbregister" in "/scratch/03763/oesteban/fmriprep-work/sub-NDARINV00CY2MDM/fmriprep_wf/single_subject_NDARINV00CY2MDM_wf/func_preproc_ses_baselineYear1Arm1_task_rest_run_04_wf/bold_reg_wf/bbreg_wf/bbregister".
191010-06:42:30,353 nipype.workflow INFO:
[Job 615] Cached (fmriprep_wf.single_subject_NDARINV00CY2MDM_wf.func_preproc_ses_baselineYear1Arm1_task_rest_run_03_wf.bold_reg_wf.bbreg_wf.mri_coreg).
191010-06:42:30,378 nipype.workflow INFO:
[Node] Cached "fmriprep_wf.single_subject_NDARINV00CY2MDM_wf.func_preproc_ses_baselineYear1Arm1_task_rest_run_04_wf.bold_reg_wf.bbreg_wf.bbregister" - collecting precomputed outputs
191010-06:42:30,378 nipype.workflow INFO:
[Node] "fmriprep_wf.single_subject_NDARINV00CY2MDM_wf.func_preproc_ses_baselineYear1Arm1_task_rest_run_04_wf.bold_reg_wf.bbreg_wf.bbregister" found cached.
When the node is checked by the execution plugin, we see the warning is issued when trying to read the outputs of a prior node feeding into the bbregister node. A few seconds later, the results files for the inputs are checked again and now it works - the node is found cached.
Summary
When running fmriprep 1.5.0rc2 in a singularity container, I get the following nipype.workflow warning. It asked for opening an issue here. I don't get this error when I run fmriprep 1.4.1rc1.
The warning/error:
Script/Workflow details
My Singularity command:
Happy to provide more information.