Apparent hanging during bold_confound_wf

psadil commented 7 years ago

I'm not sure whether I'm just being impatient, or whether the signals node is hanging. I'm running fmriprep-docker in Windows 10, 8 cores available ~12GB RAM.

docker run --rm -it -v C:\Users\psadil\Desktop\vtf:/data:ro -v C:\Users\psadil\Desktop\vtf\derivatives:/out -v C:\Users\psadil\Desktop\vtf\derivatives\work:/scratch poldracklab/fmriprep:1.0.0-rc2 /data /out participant --output-space T1w template fsnative fsaverage -w /scratch --participant_label 01

There is just a single participant, one anatomical file, and two functional runs. There is slice-timing information, no fieldmap, and the bold files have limited field-of-view (primarily visual cortex). Much of the process finished at some point last night (it took about ~6 hrs for the surface files to appear in the output directory). But, based on the timestamp for the bold_confounds_wf/signals/ directory in the working directory (of both functional runs), fmriprep has been working on that node for ~10hrs.

No crash reports have been generated in the logs subdirectory in the out/fmriprep folder. But, here are a few extra details given by docker:

docker stats --no-stream CPU % - 0.05 MEM % - 1.52 NET I/O - 15.2kb / 2.96kb BLOCK I/O - 8.92 GB / 382 MB PIDS - 12

Thank you for your time!

PS It was trying to figure this out by changing n_cups that I initially encountered #668.

effigies commented 7 years ago

How large are your functional runs? We've run into issues with the signals node when the uncompressed dataset is very large. In master, we now have a --low-mem option to instruct FMRIPREP to use more disk space to reduce memory usage.

The easiest way to confirm which node is hanging is to use --nthreads 1 (same as -n_cpus 1). Then the hanging node will be most recently run. You can also try logging into the running Docker container and:

apt-get update && apt-get install htop
htop

This will help you see the actual processes, and you can sort by memory and CPU usage, and may be a bit more informative than those stats.

psadil commented 7 years ago

Oh, sorry for not paying better attention to the other issue threads. I'll try with master, monitoring with htop along the way.

The functional runs are 218 TRs of 113 x 153 x 149 images (2mm^3 resolution).

edit: that is, I'll try with master and the --low-mem flag

effigies commented 7 years ago

No worries! It's a lot to ask of users to get deep into the weeds of the development process.

That's a fairly modest image (4GB if it's float64, 2GB for float32), so I doubt that's the issue. Hopefully htop will reveal more about what's going on. You can also add the --debug flag, which should make the logs more verbose.

psadil commented 7 years ago

I cancelled the image that had stalled, started master with --low-mem and the analysis mostly finished. Didn't see this last message until later so I left --debug off. The report in sub-01.html says that there are no errors to report, but the confounds .tsv is missing and there are 4 crash files in the fmriprep/sub-01/log folder

crash-20170817-203736-root-merge-110362df-89f9-4de5-bfcf-3468a628f745.txt crash-20170817-203650-root-ds_bold_mni-63f55cc9-2ecd-4ca3-8602-9514199fadc9.txt crash-20170817-203650-root-merge-0387471d-3798-48b5-84a7-6b390a368973.txt crash-20170817-203713-root-merge-101fa912-2ea6-4864-b193-b060296ac3cf.txt

Could these crashes just be because I was working with the same out/ and work/ directories as the image that I halted?

effigies commented 7 years ago

I suppose that's possible. Given that it's an Input/output error in all cases, it was presumably something between the Linux VFS layer and your host filesystem, but at a guess, that could range from out-of-space, to a timeout (Docker for Windows uses CIFS, which is a network filesystem, to mount), to a checksum failure.

I suspect if you remove the crashfiles and rerun, it'll work properly, as there should be less concurrent use of the filesystem.

psadil commented 7 years ago

Okay. I reran the analysis with just one functional run after deleting the output and working folders. The sub-01.html again reported no errors, the confounds.tsv file looked okay, but there were no inflated surfaces and seven new crash logs appeared:

crash-20170818-221940-root-sampler.aI.a1-db54e678-ec77-48cd-ae5d-47cbca1e4b83.txt crash-20170818-221810-root-_rename_src0-2b399789-72cf-41b2-880e-2a196d2b9e45.txt crash-20170818-221846-root-_rename_src1-e5034938-bce0-42e1-8fa9-38bdccd16e49.txt crash-20170818-221938-root-rename_src-b7f1f274-1a89-4dea-af6c-f790faaf09b3.txt crash-20170818-221940-root-sampler.aI.a0-34d41ee9-2c4b-47ec-a7e1-5091eed4ec67.txt crash-20170818-221940-root-sampler.aI.a0-d874371f-15d7-479c-ac78-615ae1d1f425.txt crash-20170818-221940-root-sampler.aI.a1-5a7ee8ab-6f28-42e2-995a-779e4daa819e.txt

docker run -ti --rm -v C:\Users\admin\Desktop\vtf:/data:ro -v C:\Users\admin\Desktop\vtf\derivatives:/out -v C:\Users\admin\Desktop\vtf\derivatives\work:/scratch fmriprep:master /data /out participant --output-space T1w template fsnative fsaverage -w /scratch --participant_label 01 --low-mem

effigies commented 7 years ago

Sorry for the delay. These look like more IO errors. All I can suggest is to clear them out and re-run.

psadil commented 7 years ago

well, that worked. ran with a clean working directory and empty derivatives and this time it ran without errors. I look forward to whatever release includes the low-mem option.

Thanks!

effigies commented 7 years ago

@psadil Yesterday's release includes --low-mem.

psadil commented 7 years ago

Cool, thanks again for the help and the software.

danielkimmel commented 6 years ago

Hi @psadil and @effigies, I'm running into a very similar problem ... twice. The first time was on a low-powered machine that was running out of HDD space for the scratch. I'm now running on a server with 2TB of scratch and 32GB of mem. In general, fmriprep has been running well for single subjects on the high-spec'd box. However, when processing the full dataset (~25 subjects), it got through many of the steps, but is now hanging for about 3 days on one node:

The log for this subject suggests that in a previous attempt (within the same call to fmriprep) to run this subject, the process crashed because of space:

Node: fmriprep_wf.single_subject_18042401_wf.func_preproc_ses_01_task_resting_run_02_wf.bold_bold_trans_wf.bold_reference_wf.validate
Working directory: /scratch/fmriprep_wf/single_subject_18042401_wf/func_preproc_ses_01_task_resting_run_02_wf/bold_bold_trans_wf/bold_reference_wf/validate

Node inputs:

ignore_exception = False
in_file = /scratch/fmriprep_wf/single_subject_18042401_wf/func_preproc_ses_01_task_resting_run_02_wf/bold_bold_trans_wf/merge/vol0000_xform-00000_merged.nii.gz

Traceback (most recent call last):
  File "/usr/local/miniconda/lib/python3.6/site-packages/niworkflows/nipype/pipeline/plugins/multiproc.py", line 68, in run_node
    result['result'] = node.run(updatehash=updatehash)
  File "/usr/local/miniconda/lib/python3.6/site-packages/niworkflows/nipype/pipeline/engine/nodes.py", line 480, in run
    result = self._run_interface(execute=True)
  File "/usr/local/miniconda/lib/python3.6/site-packages/niworkflows/nipype/pipeline/engine/nodes.py", line 564, in _run_interface
    return self._run_command(execute)
  File "/usr/local/miniconda/lib/python3.6/site-packages/niworkflows/nipype/pipeline/engine/nodes.py", line 644, in _run_command
    result = self._interface.run(cwd=outdir)
  File "/usr/local/miniconda/lib/python3.6/site-packages/niworkflows/nipype/interfaces/base/core.py", line 520, in run
    runtime = self._run_interface(runtime)
  File "/usr/local/miniconda/lib/python3.6/site-packages/fmriprep/interfaces/images.py", line 468, in _run_interface
    img.to_filename(out_fname)
  File "/usr/local/miniconda/lib/python3.6/site-packages/nibabel/filebasedimages.py", line 334, in to_filename
    self.to_file_map()
  File "/usr/local/miniconda/lib/python3.6/site-packages/nibabel/analyze.py", line 1096, in to_file_map
    arr_writer.to_fileobj(imgf)
  File "/usr/local/miniconda/lib/python3.6/site-packages/nibabel/arraywriters.py", line 562, in to_fileobj
    nan2zero=self._needs_nan2zero())
  File "/usr/local/miniconda/lib/python3.6/site-packages/nibabel/volumeutils.py", line 766, in array_to_file
    nan_fill=nan_fill if nan2zero else None)
  File "/usr/local/miniconda/lib/python3.6/site-packages/nibabel/volumeutils.py", line 833, in _write_data
    fileobj.write(dslice.tostring())
  File "/usr/local/miniconda/lib/python3.6/site-packages/nibabel/openers.py", line 205, in write
    return self.fobj.write(*args, **kwargs)
  File "/usr/local/miniconda/lib/python3.6/gzip.py", line 264, in write
    self.fileobj.write(self.compress.compress(data))
OSError: [Errno 28] No space left on device

However, my disk usage on the share with the output and scratch is far below the max (using 1.2 of 2TB). Memory usage is also low (4 of 32GB). CPU has been pegged at 100% for 1 core. (There are 12 cores, most of which were in use during most of the fmriprep job, until the current hang). I've also checked the docker container, which does not have a cap on its disk image. So it's hard to see how fmriprep is running out of space. (Note that this crash report was prior to the current ongoing hang, so it could be the hang is unrelated to space).

The job took a long-time to get this far, so I'm loathe to stop it if there's a) some way to rescue it, or b) some debugging info I should get before terminating it.

What do you all recommend?

Thanks very much, Daniel

effigies commented 6 years ago

Hi @danielkimmel, can you open a new issue? It's hard to track conversations in closed issues.

danielkimmel commented 6 years ago

Hi Chris, Sure. Just did: https://github.com/poldracklab/fmriprep/issues/1231 https://github.com/poldracklab/fmriprep/issues/1231

d

On Aug 1, 2018, at 10:07 AM, Chris Markiewicz notifications@github.com wrote:

Hi @danielkimmel https://github.com/danielkimmel, can you open a new issue? It's hard to track conversations in closed issues.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/poldracklab/fmriprep/issues/671#issuecomment-409587384, or mute the thread https://github.com/notifications/unsubscribe-auth/Ae1vFlyBBFWQmAUnjAdaOV1DYPxzsrfGks5uMbYngaJpZM4O6dS4.

nipreps / fmriprep

Apparent hanging during bold_confound_wf #671