Closed oesteban closed 4 years ago
Cleaning up the cache solved the problem.
@satra - I believe this (and #3009) are an artifact of working directory degradation, but I'm worried that they could have some other source since they happen while _get_inputs
is run. Gathering inputs from previous nodes should work, since their result file has been already checked (and should exist, for the case of #3009. In this case of MapNodes, I think that the outputs of the prior node got aggregated for some obscure reason.
Before I go ahead and start some refactor that can end up being a rats' nest, we wanted to touch base with you. Do you want to allocate some time to chat (cc/ @effigies )?
let's chat. i would like to understand if this is something that is a new bug that was created by a commit, or is because of mixing versions of caches.
I wonder if this is the same problem as #2694. IIRC, that was also impossible to reproduce because a clean working directory didn't exhibit the problem, and there was no clear path toward intentionally producing a broken one.
I believe that last month's refactor should have addressed that, precisely.
My impression is that nipype is overly noisy: it seems that the first time the cache is checked -early in checking it there is a call to _get_inputs()
- in the master thread, the result file of some preceding node is not yet found. In the second pass of the cache check (when the node has been already submitted to the executor, so this is done, e.g., by the child process or a new cluster node), it seems that _get_inputs()
actually finds the result file.
This synchronization issue worries me a little, as I would think the result file of preceding nodes should be available already in the first call to _get_inputs()
.
@oesteban - this really depends on the file system and the executor.
should never be an issue using multiproc/concurrentfuture, since the filesystem is local and any filesystem cache does not have to be written out to a server. if the issue is happening with local executors, it is likely a different reason.
this is often the case in many job schedulers with NFS clients/servers, which is why there is a timeout configuration in nipype to explicitly deal with delayed dirty cache writes across compute nodes. a user can change this config option for such scenarios.
Right, that's why I believe this is mostly noise and things aren't looking bad. However, not being able to come up with a clear explanation for what is happening has me a bit bugged (mostly because of 1 in your comment).
Please note that the errors happen within _get_inputs()
, which means that the failing cache is not the results file of the current node, but some prior node. We will need to improve the cache checking logic so that _get_inputs()
is not called this early, and first check whether the hashfile and the result file are present. Right now, the cache check is as follows:
_get_inputs()
which propagates results from prior nodes connected to the inputs.A useful optimization that I'm sure will minimize some of these issues is a new step between 1-2 to check whether the result file exists and a hashfile-looking file exists. However, that optimization should happen after we clear up this weirdness because in local filesystems _get_inputs()
should not fail.
Is there a path forward for testing this, or is it mostly a "wait and see if this keeps cropping up after the next release" sort of issue?
I'm afraid we are in a "wait and see if this keeps cropping up after the next release" sort of issue situation. There might be a path forward based on the fact that there are a bunch of usual suspects falling for this, so it doesn't seem to be a general problem of all MapNodes.
However, this is harder to workaround than #3009 because for MapNodes we really need to propagate the inputs from prior interfaces at this point to know how many nodes are spawned.
Okay. Just going to move it out of 1.2.3 then.
Hi all, I'm seeing a problem that I think is related to this issue. If I use the multiproc plugin and commit 01de656ff74e2804f8b1a0494ae69e8fc66766ce
Error while checking node hash, forcing re-run. Although this error may not prevent the workflow from running, it could indicate a major problem. Please report a new issue at https://github.com/nipy/nipype/issues adding the following information:
Node: report.page_sub-XXXX.reportlets
Interface: nipype.interfaces.utility.base.Merge
Traceback:
Traceback (most recent call last):
File "/home/stilley2/lib/nipype/nipype/pipeline/plugins/base.py", line 337, in _local_hash_check
cached, updated = self.procs[jobid].is_cached()
File "/home/stilley2/lib/nipype/nipype/pipeline/engine/nodes.py", line 297, in is_cached
hashed_inputs, hashvalue = self._get_hashval()
File "/home/stilley2/lib/nipype/nipype/pipeline/engine/nodes.py", line 494, in _get_hashval
self._get_inputs()
File "/home/stilley2/lib/nipype/nipype/pipeline/engine/nodes.py", line 521, in _get_inputs
outputs = _load_resultfile(results_file).outputs
File "/home/stilley2/lib/nipype/nipype/pipeline/engine/utils.py", line 301, in load_resultfile
result = loadpkl(results_file)
File "/home/stilley2/lib/nipype/nipype/utils/filemanip.py", line 689, in loadpkl
with pklopen(infile.name, 'rb') as pkl_file:
File "/usr/lib64/python3.7/gzip.py", line 53, in open
binary_file = GzipFile(filename, gz_mode, compresslevel)
File "/usr/lib64/python3.7/gzip.py", line 163, in __init__
fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: 'result_compare1.pklz'
This happens pretty consistently. Also, occasionally the pipeline will freeze. If i update to the next commit (686aef9) things seem to work. However, I'm not sure if that is just hiding the problem. I am doing everything on a local file system. If you all think this is unrelated I can open a new issue.
Thanks!
HI @stilley2, looking at your traceback, this totally looks like #3009. This one has been left open after #3026 was merged because it only affects MapNodes as @mattcieslak reported in https://github.com/nipy/nipype/issues/3009#issuecomment-527547733.
However, this WARNING should not be fatal - your processing should have continued and if it failed, that should've happened for some other reason. You are right that that commit basically preempts this WARNING to show up in most of the cases by more carefully checking the work directory when it looks like the node is cached.
Could you try to isolate what is the condition that causes the pipeline freeze? How did you check that it was blocked?
I just did some digging. The pipeline hangs at getting the lockfile in filemanip.loadpkl. I assume that I don't see the error in later commits because loadpkl isn't called as frequently or something. I think adding a timeout to the lock is a good idea though. Out of curiosity, how is it that we're trying to load results files before they're ready to check the node hash, but not running into this problem when we actually run the nodes?
@stilley2 - thanks for this. i am pretty positive i introduced an unwanted race condition. via the locking mechanism. i'll take a look later today.
I think I found what might be the cause for a lot of these issues. In loadpkl the directory is changed via the indirectory context manager https://github.com/nipy/nipype/blob/56a033548b833dc5fbe831562f966580a3f677c5/nipype/utils/filemanip.py#L710 . However, between these line and any actual file loading (or presumably lock activity), the directory can be changed via the _async_callback function in multiproc.py https://github.com/nipy/nipype/blob/56a033548b833dc5fbe831562f966580a3f677c5/nipype/pipeline/plugins/multiproc.py#L148. This can lead to the warning about results files not being found, and might be responsible for the lockfile issues as well.
If anyone wants to prove this to themselves, just add the following lines right after the indirectory context manager
from time import sleep
sleep(10)
if os.getcwd() != infile.parent:
raise RuntimeError('directory switched')
and run a pipeline with the MultiProc plugin.
To keep close: https://neurostars.org/t/fmriprep-1-5-0-errors/5122/5
Interestingly, it seems MultiProc checks on mapnodes twice:
191101-16:56:56,106 nipype.workflow INFO:
[Node] Setting-up "_t1w_conform_xfm0" in "/home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform_xfm/mapflow/_t1w_conform_xfm0".
191101-16:56:56,105 nipype.utils INFO:
Running nipype version 1.3.0-rc1.post-dev+g2e5436d31 (latest: 1.2.3)
191101-16:56:56,108 nipype.workflow INFO:
[Node] Running "_t1w_conform_xfm0" ("niworkflows.interfaces.freesurfer.PatchedLTAConvert"), a CommandLine Interface with command:
lta_convert --inlta identity.nofile --outlta /home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform_xfm/mapflow/_t1w_conform_xfm0/out.lta --src /mnt/data/datasets/THP002/sub-THP0005/ses-THP0005JHU1/anat/sub-THP0005_ses-THP0005JHU1_run-02_T1w.nii.gz --trg /home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform/mapflow/_t1w_conform0/sub-THP0005_ses-THP0005JHU1_run-02_T1w_ras.nii.gz
191101-16:56:56,159 nipype.utils INFO:
Running nipype version 1.3.0-rc1.post-dev+g2e5436d31 (latest: 1.2.3)
191101-16:56:56,236 nipype.workflow INFO:
[Node] Finished "_t1w_conform_xfm0".
191101-16:56:56,266 nipype.workflow INFO:
[Node] Setting-up "_t1w_conform_xfm1" in "/home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform_xfm/mapflow/_t1w_conform_xfm1".
191101-16:56:56,269 nipype.workflow INFO:
[Node] Running "_t1w_conform_xfm1" ("niworkflows.interfaces.freesurfer.PatchedLTAConvert"), a CommandLine Interface with command:
lta_convert --inlta identity.nofile --outlta /home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform_xfm/mapflow/_t1w_conform_xfm1/out.lta --src /mnt/data/datasets/THP002/sub-THP0005/ses-THP0005MGH1/anat/sub-THP0005_ses-THP0005MGH1_T1w.nii.gz --trg /home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform/mapflow/_t1w_conform1/sub-THP0005_ses-THP0005MGH1_T1w_ras.nii.gz
191101-16:56:56,409 nipype.workflow INFO:
[Node] Finished "_t1w_conform_xfm1".
191101-16:56:57,337 nipype.workflow INFO:
[Node] Setting-up "dmriprep_wf.single_subject_THP0005_wf.anat_preproc_wf.anat_template_wf.t1w_conform_xfm" in "/home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform_xfm".
191101-16:56:57,340 nipype.workflow INFO:
[Node] Setting-up "_t1w_conform_xfm0" in "/home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform_xfm/mapflow/_t1w_conform_xfm0".
191101-16:56:57,341 nipype.workflow INFO:
[Node] Cached "_t1w_conform_xfm0" - collecting precomputed outputs
191101-16:56:57,341 nipype.workflow INFO:
[Node] "_t1w_conform_xfm0" found cached.
191101-16:56:57,342 nipype.workflow INFO:
[Node] Setting-up "_t1w_conform_xfm1" in "/home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform_xfm/mapflow/_t1w_conform_xfm1".
191101-16:56:57,342 nipype.workflow INFO:
[Node] Cached "_t1w_conform_xfm1" - collecting precomputed outputs
191101-16:56:57,343 nipype.workflow INFO:
[Node] "_t1w_conform_xfm1" found cached.
191101-16:56:57,344 nipype.workflow INFO:
[Node] Finished "dmriprep_wf.single_subject_THP0005_wf.anat_preproc_wf.anat_template_wf.t1w_conform_xfm".
You'll see first one of those:
191101-16:56:56,266 nipype.workflow INFO:
[Node] Setting-up "_t1w_conform_xfm1" in "/home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform_xfm/mapflow/_t1w_conform_xfm1".
and about a second later:
191101-16:56:57,342 nipype.workflow INFO:
[Node] Setting-up "_t1w_conform_xfm1" in "/home/oesteban/tmp/dmriprep-THP002/work/dmriprep_wf/single_subject_THP0005_wf/anat_preproc_wf/anat_template_wf/t1w_conform_xfm/mapflow/_t1w_conform_xfm1".
This is happening running sMRIPrep from dMRIPrep on a local installation of nipype on the branch of #3075 .
This is on a fresh work directory. If I reuse the old workdir, then MultiProc preempt running it again (only once).
mapnodes have always done that for any distributed plugin. they first insert all the subnodes, then insert the parent node as a dependency of these subnodes. when the parent node re-executes, it checks all the subnodes once more, but at this point they shouldn't run because they have been cached.
Okay, that's reassuring in the sense that we did not introduce this during a refactor - but my expectation about having found a potential lead is now gone :P
@oesteban - do you know if #3089 resolved this?
Summary
https://circleci.com/gh/oesteban/smriprep/502
Related to #3009 (in that case, it occurred for a regular Node).