torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
656 stars 122 forks source link

fastq_mergepairs randomly fails to finish when called in parallel #469

Closed fjossandon closed 2 years ago

fjossandon commented 2 years ago

Hello, I frequently use several functions of Vsearch, and along the way I've noticed that the --fastq_mergepairs option sometimes just hangs and the process don't finish after a several hours while the CPU is doing nothing, something that I have not seen happening in other options like --uchime_denovo or --usearch_global. Just killing the process and launching the command again usually works fine, so it's hard to reproduce. It's also worth to mention that the command is usually executed in multiprocess.

I tested Vsearch using the last version:

fossandon@ubuntu:~/Documents/$ python3 -V
Python 3.6.9

fossandon@ubuntu:~/Documents/$ vsearch -v
vsearch v2.19.0_linux_x86_64, 15.6GB RAM, 4 cores
https://github.com/torognes/vsearch

Rognes T, Flouri T, Nichols B, Quince C, Mahe F (2016)
VSEARCH: a versatile open source tool for metagenomics
PeerJ 4:e2584 doi: 10.7717/peerj.2584 https://doi.org/10.7717/peerj.2584

Compiled with support for gzip-compressed files, and the library is loaded.
zlib version 1.2.11, compile flags a9
Compiled with support for bzip2-compressed files, and the library is loaded.

I recently had some time to search for a test case and put a small script to reproduce it, so I put it in Drive because the files are a little big: https://drive.google.com/drive/folders/1GEfmai3TarXrSBchLET68VamZ_IbaA46?usp=sharing

The compressed file contains a test case for which I originally run the fastq_mergepairs command using fastqout first to get the merged reads, and then again using fastqout_notmerged_fwd instead to get another file with only the forward sequences of the reads that could not be merged, but the unmerged command hanged in this case:

vsearch --fastq_mergepairs input_119148_R1.fastq --reverse input_119148_R2.fastq --fastq_allowmergestagger --fastq_minlen 100 --fastq_minovlen 100 --fastq_minmergelen 100 --fastq_maxns 20 --fastq_maxdiffs 50 --fastq_eeout --fastqout_notmerged_fwd 119148_unmerged.fastq

The compressed file have the following files:

I want to note that although the test scripts reuses around 10 different output files with the same input during execution to save space, in the original case all the parallel processes where handling different inputs and outputs.

I executed the sequential script a couple of times and Vsearch did not hang. The normal Vsearch execution only takes around 1 second each iteration. On the other hand, while executing the parallel script I found that 2 processes hanged out of the 1000 iterations. I was using a limit of 3 active processes, so the script was able to reach the last iteration with the 3rd one, but could not reach the last print because of the 2 that did not close. The prints looks like this:

launching up to 3 parallel processes
2 started
3 started
1 started
Now will wait for all processes to finish
2  finished!
4 started
3  finished!
...................................
999 started
999  finished!
1000 started
1000  finished!

The script cannot print the last "Finished!" message and stay like that unless killed.

In this last iteration, if I execute ps ax I can see that there were 2 processes still alive:

fossandon@ubuntu:~/Documents/$ ps ax | grep "vsearch"
 60273 pts/11   Sl+    0:17 python3 vsearch_test_parallel.py
 60275 pts/11   S+     0:01 python3 vsearch_test_parallel.py
 60276 pts/11   S+     0:00 python3 vsearch_test_parallel.py
 60277 pts/11   S+     0:00 python3 vsearch_test_parallel.py
 60540 pts/11   Sl+    0:02 vsearch --fastq_mergepairs input_119148_R1.fastq --reverse input_119148_R2.fastq --fastq_allowmergestagger --fastq_minlen 100 --fastq_minovlen 100 --fastq_minmergelen 100 --fastq_maxns 20 --fastq_maxdiffs 50 --fastq_eeout --fastqout_notmerged_fwd test_0_unmerged.fastq
 60565 pts/11   Sl+    0:02 vsearch --fastq_mergepairs input_119148_R1.fastq --reverse input_119148_R2.fastq --fastq_allowmergestagger --fastq_minlen 100 --fastq_minovlen 100 --fastq_minmergelen 100 --fastq_maxns 20 --fastq_maxdiffs 50 --fastq_eeout --fastqout_notmerged_fwd test_5_unmerged.fastq
 68322 pts/12   S+     0:00 grep --color=auto vsearch

In summary, this apparently happens when multiprocessing, and randomly so several repetitions are needed to trigger.

I'm clueless of why it happens, but seems specific to fastq_mergepairs. I hope you can reproduce it too with the scripts.

Best regards,

torognes commented 2 years ago

Thanks for reporting this issue and the very detailed description. I'll look into the issue soon.

frederic-mahe commented 2 years ago

Very nice bug report @fjossandon ! This is probably similar to the elusive issue #439

I did my best in March 2021 to reproduce and analyze what I think to be a multithreading issue, but without any success. I used rr and its option rr record --chaos. According to its documentation rr is good for finding multithreading bugs by artificially starving one thread.

torognes commented 2 years ago

I have been able to reproduce the problem by running vsearch 2.20.0 repeatedly with your input files at least under the following circumstances:

Usually the hang occurred before 100 runs were done. It seems to happen more often with just 2 threads.

It hangs after printing "Merging reads 100%" which indicates that it has processed all data, but do not understand that it is fully finished. I am pretty sure it has to do with some thread synchronisation on lines 1100-1300 in mergepairs.cc. Perhaps a deadlock. The code here is somewhat complex as threads are used to both read, write and process data in chunks.

torognes commented 2 years ago

The hang seems to happen when the number of input read pairs is a multiple of 500. This the chunk size, the number of reads processed at a time, and there is a special case in the code when the last set of sequences exactly fills this limit.

fjossandon commented 2 years ago

Hi @torognes , very interesting. I'm glad that you were able to reproduce it and narrow down to the affected code. It seems you are getting closer to the solution.

fjossandon commented 2 years ago

By the way, I mentioned that I found a second case were this happened, although that case had 8 reads missing at the end of the incomplete file instead of 1. The test case I gave had 166,000 reads, and I just checked and the second case had 124,500, also a multiple of 500, so I think that also supports your observation.

torognes commented 2 years ago

Fixed in commit 76c7d5560040596629ef1041640f255260a44d6f, I think. I forgot to test a rare condition that can happen due to timing of reading and writing files.

torognes commented 2 years ago

@fjossandon , could you see if the problem is still there? Or should I make a new release (2.20.1) for you first?

torognes commented 2 years ago

Fixed in version 2.20.1, just released.

fjossandon commented 2 years ago

Hello @torognes , I downloaded the new patched version and executed my script several times and there was no hang. It seems that the problem is gone. Now I can remove the test files from Drive since they are no longer needed.

Thanks for fixing it quickly! =)

torognes commented 2 years ago

Great! Thanks for testing!