qwarp running for now 50+ hours

bpinsard commented 4 years ago

Using fmriprep 20.1.1 (setup.cfg show sdcflows ~= 1.3.1) in singularity on cluster, the qwarp step of pepolar pipeline has now been running for more than 50 hours on multiple jobs/runs of the pipelines on data from the same dataset.

Is this expected to be that long? If not what can I do to diagnose the problem?

Thanks!

oesteban commented 4 years ago

No, that is not expected. Is the 3dQwap tool actually running on the node or you just checked on the output log?

bpinsard commented 4 years ago

3dQwarp was actually running on the node (job timed-out), high CPU usage and all. I will check what is left in the workdir, the input files etc.

bpinsard commented 4 years ago

Ran the same 3dQwarp command outside of singularity with neurodebian afni on my laptop, completes quickly. Ran the same command inside the singularity on my laptop, completes quickly. Ran the same command inside the singularity on the cluster, completes quickly, but this was without openmp (as set in Dockerfile). Ran the same command inside the singularity on the cluster with OMP_NUM_THREADS=8 , got the following error, but still completes faster as expected.

skipping - powell_newuoa_con() failure code=-1
 + powell_newuoa_con( ndim=16  x=0x2921f90  xbot=0x29229e0  xtop=0x2922a70  nrand=0  rstart=0.444000  rend=0.003996  maxcall=159  ufunc=0x49c3a0

I see two possibilities:

a temporary error (not transient as it occurred for multiple jobs) on the cluster filesystem.
a deadlock because of the requested resources: 3dQwarp is supposed to use OpenMP, but if another nipype process request CPU time, this might be problematic. I was using --resource-monitor to evaluate the resources needed on a subset of the dataset to then heuristically request them more accurately from SLURM. I have no idea how this works in nipype, but I can imagine that the processes might have some inter-dependencies because one is monitoring the other. I will resubmit the pipeline without --resource-monitor to see if it completes.

bpinsard commented 4 years ago

Launched it without --resource-monitor, it is slow but progressing (strace on the node show regular progress outputs from 3dQwarp on stderr). It could be slow due to multiple 3dQwarp with OMP_NUM_THREADS=8 running concurrently with n_cpus=8, because the subject has multiple BOLD runs. It is even slower than running with OMP_NUM_THREADS=1. At the current pace, it might reach 50+hours.

In parallel, I launched a single 3dQwarp node through nipype (load .pklz, run) in singularity. The processing speed is normal, completes.

bpinsard commented 4 years ago

I reran the pipeline setting -n_cpus=16 and --omp_nthreads=4 and it completed in a reasonable amount of time. I am not familiar with OpenMP programming, but it seems that concurrent processes each using the max number of cpu threads, a lot of time might be spent in context switching.

effigies commented 4 years ago

Is the problem that we're just not marking 3dqwarp as a multithreaded node?

bpinsard commented 4 years ago

It seems that nprocs is specified. https://github.com/nipreps/sdcflows/blob/c3ffa59418303854f8ca9d1ff1c2a7f22978e643/sdcflows/workflows/pepolar.py#L123-L126

effigies commented 4 years ago

Yeah, you're right. Sorry, not able to attend closely rn.

oesteban commented 3 years ago

Should we not allow 3dQwarp to access more than 4 CPUs? It's not clean at all, but it will preempt this from happening, it seems.

effigies commented 3 years ago

So qwarp_nprocs = min(omp_nthreads, 4)? I don't see that as inelegant, if 3dQwarp can't usefully take more.

nipreps / sdcflows

qwarp running for now 50+ hours #107