statisticalbiotechnology / quandenser-pipeline

A nextflow/singularity pipeline for quandenser
Apache License 2.0
5 stars 1 forks source link

MSconvert crash, caused by failed singularity mount (loop devices) #22

Closed TimothyOlsson closed 5 years ago

TimothyOlsson commented 5 years ago

Since MSconvert uses wine and runs on "loop devices" in linux, it can sometimes fail. The error seems to occur when you use too many forks than what the computer can handle. The crashes occured when running 8 forks on a computer with 6 cores. I am currently testing whether the this is the cause or not of the error.

Note: https://github.com/statisticalbiotechnology/quandenser-pipeline/commit/7145d05a1e6db361daf18fbde89e33f2fae0deb8 is in this case a double edged sword: Processes crashing due to failure to initiate Singularity restarts and finishes, but "freezes" due to limit in number of cores, which require manual intervention to kill the process, which means the pipeline stops completely. This is bad for a number of reasons.

Error message:

Caused by:
  Process `msconvert (8)` terminated with an error exit status (255)

Command executed:

  mkdir -p converted
  wine msconvert 20161018_QEp2_PhGe_SA_LC12-14_Bariatric_Plate1_a12_F9_post.raw --filter "peakPicking true 1-" -o converted  | tee -a stdout.txt

Command exit status:
  255

Command output:
  (empty)

Command error:
  FATAL:   container creation failed: mount /proc/self/fd/7->/usr/local/var/singularity/mnt/session/rootfs error: can't mount image /proc/self/fd/7: failed to find loop device: could not attach image file to loop device: failed to set loop flags on loop device: resource temporarily unavailable

Work dir:
  /home/tib/paper_runs/PXD009348_Boxcar_Plasma/work/70/318cdc7b54b5204bd7b9656c3fddad

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
TimothyOlsson commented 5 years ago

This crash seems to not only be connected to msconvert, but any process in Singularity. However, other processes does not "freeze" like msconvert

Edit: This also does not seem to be connected to more forks than cores, since it can still happen when there are less forks than cores. The singularity version is 3.4.0-111.gd27885659 and the process is running on a new computer. I have never experienced this issue in version 3.0.3-582.ge13861e9

https://github.com/sylabs/singularity/issues/2540 and more. This seems to be a common issue with singularity and could happen due to a lot of factors.

TimothyOlsson commented 5 years ago

"Kind of" fixed in commit https://github.com/statisticalbiotechnology/quandenser-pipeline/commit/46ee2647a81d019c1665df1db0555ac52363b795, since the error strategy will try to remount the image if it fails. If a process fails to mount twice in a row, it will still crash, but not as frequent. This will fix the issue for the majority of the users, who will probably never encounter this anymore.

TimothyOlsson commented 4 years ago

This issue is "kind of fixed", but the real issue is still unknown. Perhaps this could be the problem that is occurring, which has been fixed in a later Singularity version?

https://github.com/sylabs/singularity/issues/4048 and https://github.com/sylabs/singularity/pull/4069

In that case, I will need to change the shell script to install the latest singularity version, which in turn will interfere with the running jobs tab (see issue #21 )

EDIT: Testing on 3.4.1 briefly (the latest singularity version), the running jobs tab works and I can once again see host processes. Singularity v3.4.0 had the issue which was fixed, aka

"Fixes an issue where a PID namespace was always being used"

If that is the case, checking for updates for singularity and reinstalling it could be possible with the shell script