fastq_mergepairs randomly fails to finish when called in parallel

fjossandon commented 2 years ago

Hello, I frequently use several functions of Vsearch, and along the way I've noticed that the --fastq_mergepairs option sometimes just hangs and the process don't finish after a several hours while the CPU is doing nothing, something that I have not seen happening in other options like --uchime_denovo or --usearch_global. Just killing the process and launching the command again usually works fine, so it's hard to reproduce. It's also worth to mention that the command is usually executed in multiprocess.

I tested Vsearch using the last version:

fossandon@ubuntu:~/Documents/$ python3 -V
Python 3.6.9

fossandon@ubuntu:~/Documents/$ vsearch -v
vsearch v2.19.0_linux_x86_64, 15.6GB RAM, 4 cores
https://github.com/torognes/vsearch

Rognes T, Flouri T, Nichols B, Quince C, Mahe F (2016)
VSEARCH: a versatile open source tool for metagenomics
PeerJ 4:e2584 doi: 10.7717/peerj.2584 https://doi.org/10.7717/peerj.2584

Compiled with support for gzip-compressed files, and the library is loaded.
zlib version 1.2.11, compile flags a9
Compiled with support for bzip2-compressed files, and the library is loaded.

I recently had some time to search for a test case and put a small script to reproduce it, so I put it in Drive because the files are a little big: https://drive.google.com/drive/folders/1GEfmai3TarXrSBchLET68VamZ_IbaA46?usp=sharing

The compressed file contains a test case for which I originally run the fastq_mergepairs command using fastqout first to get the merged reads, and then again using fastqout_notmerged_fwd instead to get another file with only the forward sequences of the reads that could not be merged, but the unmerged command hanged in this case:

vsearch --fastq_mergepairs input_119148_R1.fastq --reverse input_119148_R2.fastq --fastq_allowmergestagger --fastq_minlen 100 --fastq_minovlen 100 --fastq_minmergelen 100 --fastq_maxns 20 --fastq_maxdiffs 50 --fastq_eeout --fastqout_notmerged_fwd 119148_unmerged.fastq

The compressed file have the following files:

input_119148_R1.fastq: Forward reads input files.
input_119148_R2.fastq: Reverse reads input files.
merged_119148.fastq: The reads successfully merged.
unmerged_119148.fastq: The incomplete unmerged reads file that I copied before killing the process. It ends in a truncated read in the middle of the sequence line since it was not completely flushed. Checking both merged and unmerged files against the input, I could see that it failed to completely print the very last sequence, but in another case I saw the truncated file had around 8 reads missing at the end.
vsearch_test_sequential.py: This is a small Python3 script that iterates 1000 times over the same command to check if Vsearch hangs.
vsearch_test_parallel.py: This is a small Python3 script that iterates 1000 times over the same command in multiprocess (limited by machine cores), to check if one of the processes hangs.

I want to note that although the test scripts reuses around 10 different output files with the same input during execution to save space, in the original case all the parallel processes where handling different inputs and outputs.

I executed the sequential script a couple of times and Vsearch did not hang. The normal Vsearch execution only takes around 1 second each iteration. On the other hand, while executing the parallel script I found that 2 processes hanged out of the 1000 iterations. I was using a limit of 3 active processes, so the script was able to reach the last iteration with the 3rd one, but could not reach the last print because of the 2 that did not close. The prints looks like this:

launching up to 3 parallel processes
2 started
3 started
1 started
Now will wait for all processes to finish
2  finished!
4 started
3  finished!
...................................
999 started
999  finished!
1000 started
1000  finished!

The script cannot print the last "Finished!" message and stay like that unless killed.

In this last iteration, if I execute ps ax I can see that there were 2 processes still alive:

fossandon@ubuntu:~/Documents/$ ps ax | grep "vsearch"
 60273 pts/11   Sl+    0:17 python3 vsearch_test_parallel.py
 60275 pts/11   S+     0:01 python3 vsearch_test_parallel.py
 60276 pts/11   S+     0:00 python3 vsearch_test_parallel.py
 60277 pts/11   S+     0:00 python3 vsearch_test_parallel.py
 60540 pts/11   Sl+    0:02 vsearch --fastq_mergepairs input_119148_R1.fastq --reverse input_119148_R2.fastq --fastq_allowmergestagger --fastq_minlen 100 --fastq_minovlen 100 --fastq_minmergelen 100 --fastq_maxns 20 --fastq_maxdiffs 50 --fastq_eeout --fastqout_notmerged_fwd test_0_unmerged.fastq
 60565 pts/11   Sl+    0:02 vsearch --fastq_mergepairs input_119148_R1.fastq --reverse input_119148_R2.fastq --fastq_allowmergestagger --fastq_minlen 100 --fastq_minovlen 100 --fastq_minmergelen 100 --fastq_maxns 20 --fastq_maxdiffs 50 --fastq_eeout --fastqout_notmerged_fwd test_5_unmerged.fastq
 68322 pts/12   S+     0:00 grep --color=auto vsearch

In summary, this apparently happens when multiprocessing, and randomly so several repetitions are needed to trigger.

I'm clueless of why it happens, but seems specific to fastq_mergepairs. I hope you can reproduce it too with the scripts.

Best regards,