statisticalbiotechnology / quandenser-pipeline

A nextflow/singularity pipeline for quandenser
Apache License 2.0
5 stars 1 forks source link

MSconvert hangs, preventing pipeline progression #23

Closed TimothyBergstrom closed 4 years ago

TimothyBergstrom commented 5 years ago

I have tested on multiple computers and msconvert has worked fine in all cases (tested on a linux desktop, inside a virtual box image and on a cluster). I have encountered a very strange and annoying error which prevents the pipeline from progressing when running the pipeline on a new computer, running ubuntu 18.04,

Sometimes, when running msconvert in parallel, it hangs on "converting spectra" seemingly randomly. In most cases it happens when one process crashes like in issue https://github.com/statisticalbiotechnology/quandenser-pipeline/issues/22. The process which crashes is usually not the one which hangs and it is not the same files every time.

When the process hangs, the msconvert process is running in the background using very little CPU processing power (maybe 10% of a core, at most) and wine is also running in the background. This makes the pipeline freeze, since it does not finish the process or return a non-zero exit status, thus requires manual intervention to stop the whole pipeline (like issue https://github.com/statisticalbiotechnology/quandenser-pipeline/issues/3). After some checking on error logs, the usual "wine warnings" that accompanies the msconvert process (Note: stderr to stdout in msconvert is disabled, since it "clogs up" the stdout file), the process that hangs does usually not have these warning messages. This points the problem could possibly be related to wine.

So far, running msconvert in "non-parallel mode" seems to fix this problem, but this sacrifices the significant speed up when running the pipeline's parallel msconvert. I have found very little about this problem on the internet and the problem is isolated to the specific computer I am testing on. However, this still makes it an important problem, since other users could encounter this error as well.

The problem persists when changing the Singularity version, so it does not seem related to Singularity.

Correct:

filenames:
  15_pT2+_LFQ.raw
processing file: 15_pT2+_LFQ.raw 
calculating source file checksums
writing output file: converted\15_pT2+_LFQ.mzML
converting spectra: 60440/60440
converting chromatograms: 1/1

Hang:

filenames:
  15_pT2+_LFQ.raw
processing file: 15_pT2+_LFQ.raw 
calculating source file checksums
writing output file: converted\15_pT2+_LFQ.mzML
converting spectra: 8924/60440
TimothyBergstrom commented 4 years ago

Idea how to fix it:

Both msconvert and quandenser_parallel_1 suffers from seemingly random elements when they hang and won't return an non-zero exit status. Nextflow does not include a way to monitor the stdout and to fail the process when it hangs. The only way it could be done in Nextflow AFAIK would be to set a timer, but that is bad due to a large variance in processing time, such as large RAW files in msconvert or boxcar scans in quandenser_parallel_1.

However, if we wrap the process inside another process eg a python script, which starts the process and monitors the stdout to check for known errors, such as issue #3 and issue #23, it could possibly work. If the "monitor" script detects the known errors, it could either:

The process should also exit quickly if it hangs to minimize wasted time, but enough so processes won't be terminated if the process is just slow (like on a slow computer). Since msconvert does not yield an error message, this could be particularly difficult. A timer for when nothing has changed in stdout could be a way to do it.

TimothyBergstrom commented 4 years ago

Fixed in with file command_wrapper.py and commit https://github.com/statisticalbiotechnology/quandenser-pipeline/commit/46ee2647a81d019c1665df1db0555ac52363b795