nipreps / smriprep

Structural MRI PREProcessing (sMRIPrep) workflows for NIPreps (NeuroImaging PREProcessing tools)
https://nipreps.github.io/smriprep
Apache License 2.0
131 stars 38 forks source link

autorecon3 issue #129

Open fliem opened 5 years ago

fliem commented 5 years ago

Running fmriprep (v1.4.0) gives me random i/o-related autorecon3 errors in around half the subjects, similar to this.

Random, as it is not always the same subjects across different executions, and not always the same error. If I run it without T2w, T2.prenorm.mgz-related errors are replaced by others (for instance during mri_segstats: "No such file or directory; ERROR: loading mri/wmparc.mgz" (while the file exists))

I am running one subject per instance with 64GB memory and docker having access to nearly all.

I thought this might be because freesurfer_dir is shared on an nfs, so I tried the following two approaches, which did not do anything: i) mounted a local, unshared directory; ii) use a not-mounted directory inside the docker container. Hence, https://github.com/poldracklab/smriprep/issues/44 won't fix this.

However, freesurfer works fine when either run via bids/freesurfer:v6.0.1-5, or when running fmriprep with 1 cpu, which led me to believe this might be some fmriprep-parallelization issue.

Since autorecon3 potentially is run for lh and rh simultaneously, might the problem be that both processes try writing non-hemi-specific files at the same time (like during -T2pial or -wmparc)?

smriprep says that

The excluded steps in the second and third stages (-no<option>) are not fully hemisphere independent, and are therefore postponed to the final two stages.

but if autorecon 3 is run for lh and rh, won't issues occur there?

oesteban commented 5 years ago

Thanks for reporting this - we are extremely interested in getting to the bottom of this problem. And yes, it is an open issue for which we haven't been able to replicate in-house.

@effigies, does Franz's hypothesis about parallelization sound plausible to you?

effigies commented 5 years ago

If doing it with a single job at a time works, then it's not a parallelism problem, but a concurrency problem. That is, the data dependencies are correct, but it's possible that there is a race condition where one hemisphere ends up modifying a file just as another tries to read it. This seems strange, since recon-all -parallel does basically the same thing we do. The main difference is that, instead of letting each hemisphere progress as it can, recon-all -parallel runs each parallelizable job as two concurrent processes.

So it may be that we're getting out of lockstep, so that it's not a race condition where both are running -segstats, but one is running -segstats, while the other is running something that fiddles with wmparc.mgz.

One question is: Does this repeat when resuming? Our recon-all jobs will not try to re-run portions that have already completed, so race conditions should not reproduce consistently, as the timing should be relatively narrow.

fliem commented 5 years ago

I can confirm that resuming does not lead to this problem