Closed psadil closed 7 years ago
How large are your functional runs? We've run into issues with the signals node when the uncompressed dataset is very large. In master
, we now have a --low-mem
option to instruct FMRIPREP to use more disk space to reduce memory usage.
The easiest way to confirm which node is hanging is to use --nthreads 1
(same as -n_cpus 1
). Then the hanging node will be most recently run. You can also try logging into the running Docker container and:
apt-get update && apt-get install htop
htop
This will help you see the actual processes, and you can sort by memory and CPU usage, and may be a bit more informative than those stats.
Oh, sorry for not paying better attention to the other issue threads. I'll try with master
, monitoring with htop along the way.
The functional runs are 218 TRs of 113 x 153 x 149 images (2mm^3 resolution).
edit: that is, I'll try with master
and the --low-mem
flag
No worries! It's a lot to ask of users to get deep into the weeds of the development process.
That's a fairly modest image (4GB if it's float64, 2GB for float32), so I doubt that's the issue. Hopefully htop
will reveal more about what's going on. You can also add the --debug
flag, which should make the logs more verbose.
I cancelled the image that had stalled, started master
with --low-mem
and the analysis mostly finished. Didn't see this last message until later so I left --debug
off. The report in sub-01.html says that there are no errors to report, but the confounds .tsv is missing and there are 4 crash files in the fmriprep/sub-01/log
folder
crash-20170817-203736-root-merge-110362df-89f9-4de5-bfcf-3468a628f745.txt crash-20170817-203650-root-ds_bold_mni-63f55cc9-2ecd-4ca3-8602-9514199fadc9.txt crash-20170817-203650-root-merge-0387471d-3798-48b5-84a7-6b390a368973.txt crash-20170817-203713-root-merge-101fa912-2ea6-4864-b193-b060296ac3cf.txt
Could these crashes just be because I was working with the same out/
and work/
directories as the image that I halted?
I suppose that's possible. Given that it's an Input/output error in all cases, it was presumably something between the Linux VFS layer and your host filesystem, but at a guess, that could range from out-of-space, to a timeout (Docker for Windows uses CIFS, which is a network filesystem, to mount), to a checksum failure.
I suspect if you remove the crashfiles and rerun, it'll work properly, as there should be less concurrent use of the filesystem.
Okay. I reran the analysis with just one functional run after deleting the output and working folders. The sub-01.html
again reported no errors, the confounds.tsv
file looked okay, but there were no inflated surfaces and seven new crash logs appeared:
crash-20170818-221940-root-sampler.aI.a1-db54e678-ec77-48cd-ae5d-47cbca1e4b83.txt crash-20170818-221810-root-_rename_src0-2b399789-72cf-41b2-880e-2a196d2b9e45.txt crash-20170818-221846-root-_rename_src1-e5034938-bce0-42e1-8fa9-38bdccd16e49.txt crash-20170818-221938-root-rename_src-b7f1f274-1a89-4dea-af6c-f790faaf09b3.txt crash-20170818-221940-root-sampler.aI.a0-34d41ee9-2c4b-47ec-a7e1-5091eed4ec67.txt crash-20170818-221940-root-sampler.aI.a0-d874371f-15d7-479c-ac78-615ae1d1f425.txt crash-20170818-221940-root-sampler.aI.a1-5a7ee8ab-6f28-42e2-995a-779e4daa819e.txt
docker run -ti --rm -v C:\Users\admin\Desktop\vtf:/data:ro -v C:\Users\admin\Desktop\vtf\derivatives:/out -v C:\Users\admin\Desktop\vtf\derivatives\work:/scratch fmriprep:master /data /out participant --output-space T1w template fsnative fsaverage -w /scratch --participant_label 01 --low-mem
Sorry for the delay. These look like more IO errors. All I can suggest is to clear them out and re-run.
well, that worked. ran with a clean working directory and empty derivatives and this time it ran without errors. I look forward to whatever release includes the low-mem
option.
Thanks!
@psadil Yesterday's release includes --low-mem
.
Cool, thanks again for the help and the software.
Hi @psadil and @effigies, I'm running into a very similar problem ... twice. The first time was on a low-powered machine that was running out of HDD space for the scratch. I'm now running on a server with 2TB of scratch and 32GB of mem. In general, fmriprep has been running well for single subjects on the high-spec'd box. However, when processing the full dataset (~25 subjects), it got through many of the steps, but is now hanging for about 3 days on one node:
The log for this subject suggests that in a previous attempt (within the same call to fmriprep) to run this subject, the process crashed because of space:
Node: fmriprep_wf.single_subject_18042401_wf.func_preproc_ses_01_task_resting_run_02_wf.bold_bold_trans_wf.bold_reference_wf.validate
Working directory: /scratch/fmriprep_wf/single_subject_18042401_wf/func_preproc_ses_01_task_resting_run_02_wf/bold_bold_trans_wf/bold_reference_wf/validate
Node inputs:
ignore_exception = False
in_file = /scratch/fmriprep_wf/single_subject_18042401_wf/func_preproc_ses_01_task_resting_run_02_wf/bold_bold_trans_wf/merge/vol0000_xform-00000_merged.nii.gz
Traceback (most recent call last):
File "/usr/local/miniconda/lib/python3.6/site-packages/niworkflows/nipype/pipeline/plugins/multiproc.py", line 68, in run_node
result['result'] = node.run(updatehash=updatehash)
File "/usr/local/miniconda/lib/python3.6/site-packages/niworkflows/nipype/pipeline/engine/nodes.py", line 480, in run
result = self._run_interface(execute=True)
File "/usr/local/miniconda/lib/python3.6/site-packages/niworkflows/nipype/pipeline/engine/nodes.py", line 564, in _run_interface
return self._run_command(execute)
File "/usr/local/miniconda/lib/python3.6/site-packages/niworkflows/nipype/pipeline/engine/nodes.py", line 644, in _run_command
result = self._interface.run(cwd=outdir)
File "/usr/local/miniconda/lib/python3.6/site-packages/niworkflows/nipype/interfaces/base/core.py", line 520, in run
runtime = self._run_interface(runtime)
File "/usr/local/miniconda/lib/python3.6/site-packages/fmriprep/interfaces/images.py", line 468, in _run_interface
img.to_filename(out_fname)
File "/usr/local/miniconda/lib/python3.6/site-packages/nibabel/filebasedimages.py", line 334, in to_filename
self.to_file_map()
File "/usr/local/miniconda/lib/python3.6/site-packages/nibabel/analyze.py", line 1096, in to_file_map
arr_writer.to_fileobj(imgf)
File "/usr/local/miniconda/lib/python3.6/site-packages/nibabel/arraywriters.py", line 562, in to_fileobj
nan2zero=self._needs_nan2zero())
File "/usr/local/miniconda/lib/python3.6/site-packages/nibabel/volumeutils.py", line 766, in array_to_file
nan_fill=nan_fill if nan2zero else None)
File "/usr/local/miniconda/lib/python3.6/site-packages/nibabel/volumeutils.py", line 833, in _write_data
fileobj.write(dslice.tostring())
File "/usr/local/miniconda/lib/python3.6/site-packages/nibabel/openers.py", line 205, in write
return self.fobj.write(*args, **kwargs)
File "/usr/local/miniconda/lib/python3.6/gzip.py", line 264, in write
self.fileobj.write(self.compress.compress(data))
OSError: [Errno 28] No space left on device
However, my disk usage on the share with the output and scratch is far below the max (using 1.2 of 2TB). Memory usage is also low (4 of 32GB). CPU has been pegged at 100% for 1 core. (There are 12 cores, most of which were in use during most of the fmriprep job, until the current hang). I've also checked the docker container, which does not have a cap on its disk image. So it's hard to see how fmriprep is running out of space. (Note that this crash report was prior to the current ongoing hang, so it could be the hang is unrelated to space).
The job took a long-time to get this far, so I'm loathe to stop it if there's a) some way to rescue it, or b) some debugging info I should get before terminating it.
What do you all recommend?
Thanks very much, Daniel
Hi @danielkimmel, can you open a new issue? It's hard to track conversations in closed issues.
Hi Chris, Sure. Just did: https://github.com/poldracklab/fmriprep/issues/1231 https://github.com/poldracklab/fmriprep/issues/1231
d
On Aug 1, 2018, at 10:07 AM, Chris Markiewicz notifications@github.com wrote:
Hi @danielkimmel https://github.com/danielkimmel, can you open a new issue? It's hard to track conversations in closed issues.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/poldracklab/fmriprep/issues/671#issuecomment-409587384, or mute the thread https://github.com/notifications/unsubscribe-auth/Ae1vFlyBBFWQmAUnjAdaOV1DYPxzsrfGks5uMbYngaJpZM4O6dS4.
I'm not sure whether I'm just being impatient, or whether the signals node is hanging. I'm running fmriprep-docker in Windows 10, 8 cores available ~12GB RAM.
docker run --rm -it -v C:\Users\psadil\Desktop\vtf:/data:ro -v C:\Users\psadil\Desktop\vtf\derivatives:/out -v C:\Users\psadil\Desktop\vtf\derivatives\work:/scratch poldracklab/fmriprep:1.0.0-rc2 /data /out participant --output-space T1w template fsnative fsaverage -w /scratch --participant_label 01
There is just a single participant, one anatomical file, and two functional runs. There is slice-timing information, no fieldmap, and the bold files have limited field-of-view (primarily visual cortex). Much of the process finished at some point last night (it took about ~6 hrs for the surface files to appear in the output directory). But, based on the timestamp for the
bold_confounds_wf/signals/
directory in the working directory (of both functional runs), fmriprep has been working on that node for ~10hrs.No crash reports have been generated in the logs subdirectory in the out/fmriprep folder. But, here are a few extra details given by docker:
docker stats --no-stream
CPU % - 0.05 MEM % - 1.52 NET I/O - 15.2kb / 2.96kb BLOCK I/O - 8.92 GB / 382 MB PIDS - 12Thank you for your time!
PS It was trying to figure this out by changing
n_cups
that I initially encountered #668.