nipy / nipype

Workflows and interfaces for neuroimaging packages
https://nipype.readthedocs.org/en/latest/
Other
746 stars 529 forks source link

Error checking MapNode hash with fMRIPrep #3014

Closed oesteban closed 4 years ago

oesteban commented 5 years ago

Summary

190904-01:20:18,695 nipype.workflow ERROR:
     Node t1_conform failed to run on host 890949213e2e.
190904-01:20:18,696 nipype.workflow ERROR:
     Saving crash info to /tmp/ds054/derivatives/smriprep/sub-100185/log/20190904-011955_7143d03c-6f71-442e-b03b-868465d5d7f5/crash-20190904-012018-UID1001-t1_conform-79d9866d-bc88-4848-9512-c54ff701db1b.txt
Traceback (most recent call last):
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/plugins/multiproc.py", line 269, in _send_procs_to_workers
    num_subnodes = self.procs[jobid].num_subnodes()
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 1204, in num_subnodes
    self._get_inputs()
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 1219, in _get_inputs
    super(MapNode, self)._get_inputs()
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/nodes.py", line 521, in _get_inputs
    outputs = _load_resultfile(results_file)[0].outputs
  File "/usr/local/miniconda/lib/python3.7/site-packages/nipype/pipeline/engine/utils.py", line 311, in load_resultfile
    if resolve and result.outputs:
AttributeError: 'NoneType' object has no attribute 'outputs'

https://circleci.com/gh/oesteban/smriprep/502

Related to #3009 (in that case, it occurred for a regular Node).

oesteban commented 5 years ago

Cleaning up the cache solved the problem.

@satra - I believe this (and #3009) are an artifact of working directory degradation, but I'm worried that they could have some other source since they happen while _get_inputs is run. Gathering inputs from previous nodes should work, since their result file has been already checked (and should exist, for the case of #3009. In this case of MapNodes, I think that the outputs of the prior node got aggregated for some obscure reason.

Before I go ahead and start some refactor that can end up being a rats' nest, we wanted to touch base with you. Do you want to allocate some time to chat (cc/ @effigies )?

satra commented 5 years ago

let's chat. i would like to understand if this is something that is a new bug that was created by a commit, or is because of mixing versions of caches.

effigies commented 5 years ago

I wonder if this is the same problem as #2694. IIRC, that was also impossible to reproduce because a clean working directory didn't exhibit the problem, and there was no clear path toward intentionally producing a broken one.

oesteban commented 5 years ago

I believe that last month's refactor should have addressed that, precisely.

My impression is that nipype is overly noisy: it seems that the first time the cache is checked -early in checking it there is a call to _get_inputs()- in the master thread, the result file of some preceding node is not yet found. In the second pass of the cache check (when the node has been already submitted to the executor, so this is done, e.g., by the child process or a new cluster node), it seems that _get_inputs() actually finds the result file.

This synchronization issue worries me a little, as I would think the result file of preceding nodes should be available already in the first call to _get_inputs().

2694 is related to the interpretation of the result file of the current node (vs. prior nodes, which is this case).

satra commented 5 years ago

@oesteban - this really depends on the file system and the executor.

  1. should never be an issue using multiproc/concurrentfuture, since the filesystem is local and any filesystem cache does not have to be written out to a server. if the issue is happening with local executors, it is likely a different reason.

  2. this is often the case in many job schedulers with NFS clients/servers, which is why there is a timeout configuration in nipype to explicitly deal with delayed dirty cache writes across compute nodes. a user can change this config option for such scenarios.

oesteban commented 5 years ago

Right, that's why I believe this is mostly noise and things aren't looking bad. However, not being able to come up with a clear explanation for what is happening has me a bit bugged (mostly because of 1 in your comment).

oesteban commented 5 years ago

Please note that the errors happen within _get_inputs(), which means that the failing cache is not the results file of the current node, but some prior node. We will need to improve the cache checking logic so that _get_inputs() is not called this early, and first check whether the hashfile and the result file are present. Right now, the cache check is as follows:

  1. Does current node's folder exist?
  2. Run _get_inputs() which propagates results from prior nodes connected to the inputs.
  3. Check the hashfile (requires the actual inputs to calculate hash)

A useful optimization that I'm sure will minimize some of these issues is a new step between 1-2 to check whether the result file exists and a hashfile-looking file exists. However, that optimization should happen after we clear up this weirdness because in local filesystems _get_inputs() should not fail.

effigies commented 5 years ago

Is there a path forward for testing this, or is it mostly a "wait and see if this keeps cropping up after the next release" sort of issue?

oesteban commented 5 years ago

I'm afraid we are in a "wait and see if this keeps cropping up after the next release" sort of issue situation. There might be a path forward based on the fact that there are a bunch of usual suspects falling for this, so it doesn't seem to be a general problem of all MapNodes.

However, this is harder to workaround than #3009 because for MapNodes we really need to propagate the inputs from prior interfaces at this point to know how many nodes are spawned.

effigies commented 5 years ago

Okay. Just going to move it out of 1.2.3 then.

stilley2 commented 5 years ago

Hi all, I'm seeing a problem that I think is related to this issue. If I use the multiproc plugin and commit 01de656ff74e2804f8b1a0494ae69e8fc66766ce

     Error while checking node hash, forcing re-run. Although this error may not prevent the workflow from running, it could indicate a major problem. Please report a new issue at https://github.com/nipy/nipype/issues adding the following information:

    Node: report.page_sub-XXXX.reportlets
    Interface: nipype.interfaces.utility.base.Merge
    Traceback:
Traceback (most recent call last):

  File "/home/stilley2/lib/nipype/nipype/pipeline/plugins/base.py", line 337, in _local_hash_check
    cached, updated = self.procs[jobid].is_cached()

  File "/home/stilley2/lib/nipype/nipype/pipeline/engine/nodes.py", line 297, in is_cached
    hashed_inputs, hashvalue = self._get_hashval()

  File "/home/stilley2/lib/nipype/nipype/pipeline/engine/nodes.py", line 494, in _get_hashval
    self._get_inputs()

  File "/home/stilley2/lib/nipype/nipype/pipeline/engine/nodes.py", line 521, in _get_inputs
    outputs = _load_resultfile(results_file).outputs

  File "/home/stilley2/lib/nipype/nipype/pipeline/engine/utils.py", line 301, in load_resultfile
    result = loadpkl(results_file)

  File "/home/stilley2/lib/nipype/nipype/utils/filemanip.py", line 689, in loadpkl
    with pklopen(infile.name, 'rb') as pkl_file:

  File "/usr/lib64/python3.7/gzip.py", line 53, in open
    binary_file = GzipFile(filename, gz_mode, compresslevel)

  File "/usr/lib64/python3.7/gzip.py", line 163, in __init__
    fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')

FileNotFoundError: [Errno 2] No such file or directory: 'result_compare1.pklz'

This happens pretty consistently. Also, occasionally the pipeline will freeze. If i update to the next commit (686aef9) things seem to work. However, I'm not sure if that is just hiding the problem. I am doing everything on a local file system. If you all think this is unrelated I can open a new issue.

Thanks!

oesteban commented 5 years ago

HI @stilley2, looking at your traceback, this totally looks like #3009. This one has been left open after #3026 was merged because it only affects MapNodes as @mattcieslak reported in https://github.com/nipy/nipype/issues/3009#issuecomment-527547733.

However, this WARNING should not be fatal - your processing should have continued and if it failed, that should've happened for some other reason. You are right that that commit basically preempts this WARNING to show up in most of the cases by more carefully checking the work directory when it looks like the node is cached.

Could you try to isolate what is the condition that causes the pipeline freeze? How did you check that it was blocked?

stilley2 commented 5 years ago

I just did some digging. The pipeline hangs at getting the lockfile in filemanip.loadpkl. I assume that I don't see the error in later commits because loadpkl isn't called as frequently or something. I think adding a timeout to the lock is a good idea though. Out of curiosity, how is it that we're trying to load results files before they're ready to check the node hash, but not running into this problem when we actually run the nodes?

satra commented 5 years ago

@stilley2 - thanks for this. i am pretty positive i introduced an unwanted race condition. via the locking mechanism. i'll take a look later today.

stilley2 commented 5 years ago

I think I found what might be the cause for a lot of these issues. In loadpkl the directory is changed via the indirectory context manager https://github.com/nipy/nipype/blob/56a033548b833dc5fbe831562f966580a3f677c5/nipype/utils/filemanip.py#L710 . However, between these line and any actual file loading (or presumably lock activity), the directory can be changed via the _async_callback function in multiproc.py https://github.com/nipy/nipype/blob/56a033548b833dc5fbe831562f966580a3f677c5/nipype/pipeline/plugins/multiproc.py#L148. This can lead to the warning about results files not being found, and might be responsible for the lockfile issues as well.

stilley2 commented 5 years ago

If anyone wants to prove this to themselves, just add the following lines right after the indirectory context manager

    from time import sleep
    sleep(10)
    if os.getcwd() != infile.parent:
        raise RuntimeError('directory switched')

and run a pipeline with the MultiProc plugin.

oesteban commented 5 years ago

To keep close: https://neurostars.org/t/fmriprep-1-5-0-errors/5122/5

oesteban commented 4 years ago

Another one bites the dust - https://neurostars.org/t/error-populating-the-input-of-a-node-in-fmriprep-the-results-file-of-the-source-node-does-not-exist/5203/5

oesteban commented 4 years ago

Interestingly, it seems MultiProc checks on mapnodes twice:

191101-16:56:56,106 nipype.workflow INFO:
         [Node] Setting-up "_t1w_conform_xfm0" in "/home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform_xfm/mapflow/_t1w_conform_xfm0".
191101-16:56:56,105 nipype.utils INFO:
         Running nipype version 1.3.0-rc1.post-dev+g2e5436d31 (latest: 1.2.3)
191101-16:56:56,108 nipype.workflow INFO:
         [Node] Running "_t1w_conform_xfm0" ("niworkflows.interfaces.freesurfer.PatchedLTAConvert"), a CommandLine Interface with command:
lta_convert --inlta identity.nofile --outlta /home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform_xfm/mapflow/_t1w_conform_xfm0/out.lta --src /mnt/data/datasets/THP002/sub-THP0005/ses-THP0005JHU1/anat/sub-THP0005_ses-THP0005JHU1_run-02_T1w.nii.gz --trg /home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform/mapflow/_t1w_conform0/sub-THP0005_ses-THP0005JHU1_run-02_T1w_ras.nii.gz
191101-16:56:56,159 nipype.utils INFO:
         Running nipype version 1.3.0-rc1.post-dev+g2e5436d31 (latest: 1.2.3)
191101-16:56:56,236 nipype.workflow INFO:
         [Node] Finished "_t1w_conform_xfm0".
191101-16:56:56,266 nipype.workflow INFO:
         [Node] Setting-up "_t1w_conform_xfm1" in "/home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform_xfm/mapflow/_t1w_conform_xfm1".
191101-16:56:56,269 nipype.workflow INFO:
         [Node] Running "_t1w_conform_xfm1" ("niworkflows.interfaces.freesurfer.PatchedLTAConvert"), a CommandLine Interface with command:
lta_convert --inlta identity.nofile --outlta /home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform_xfm/mapflow/_t1w_conform_xfm1/out.lta --src /mnt/data/datasets/THP002/sub-THP0005/ses-THP0005MGH1/anat/sub-THP0005_ses-THP0005MGH1_T1w.nii.gz --trg /home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform/mapflow/_t1w_conform1/sub-THP0005_ses-THP0005MGH1_T1w_ras.nii.gz
191101-16:56:56,409 nipype.workflow INFO:
         [Node] Finished "_t1w_conform_xfm1".
191101-16:56:57,337 nipype.workflow INFO:
         [Node] Setting-up "dmriprep_wf.single_subject_THP0005_wf.anat_preproc_wf.anat_template_wf.t1w_conform_xfm" in "/home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform_xfm".
191101-16:56:57,340 nipype.workflow INFO:
         [Node] Setting-up "_t1w_conform_xfm0" in "/home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform_xfm/mapflow/_t1w_conform_xfm0".
191101-16:56:57,341 nipype.workflow INFO:
         [Node] Cached "_t1w_conform_xfm0" - collecting precomputed outputs
191101-16:56:57,341 nipype.workflow INFO:
         [Node] "_t1w_conform_xfm0" found cached.
191101-16:56:57,342 nipype.workflow INFO:
         [Node] Setting-up "_t1w_conform_xfm1" in "/home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform_xfm/mapflow/_t1w_conform_xfm1".
191101-16:56:57,342 nipype.workflow INFO:
         [Node] Cached "_t1w_conform_xfm1" - collecting precomputed outputs
191101-16:56:57,343 nipype.workflow INFO:
         [Node] "_t1w_conform_xfm1" found cached.
191101-16:56:57,344 nipype.workflow INFO:
         [Node] Finished "dmriprep_wf.single_subject_THP0005_wf.anat_preproc_wf.anat_template_wf.t1w_conform_xfm".

You'll see first one of those:

191101-16:56:56,266 nipype.workflow INFO:
         [Node] Setting-up "_t1w_conform_xfm1" in "/home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform_xfm/mapflow/_t1w_conform_xfm1".

and about a second later:

191101-16:56:57,342 nipype.workflow INFO:
         [Node] Setting-up "_t1w_conform_xfm1" in "/home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform_xfm/mapflow/_t1w_conform_xfm1".

This is happening running sMRIPrep from dMRIPrep on a local installation of nipype on the branch of #3075 .

This is on a fresh work directory. If I reuse the old workdir, then MultiProc preempt running it again (only once).

satra commented 4 years ago

mapnodes have always done that for any distributed plugin. they first insert all the subnodes, then insert the parent node as a dependency of these subnodes. when the parent node re-executes, it checks all the subnodes once more, but at this point they shouldn't run because they have been cached.

oesteban commented 4 years ago

Okay, that's reassuring in the sense that we did not introduce this during a refactor - but my expectation about having found a potential lead is now gone :P

satra commented 4 years ago

@oesteban - do you know if #3089 resolved this?