Closed fjossandon closed 2 years ago
Thanks for reporting this issue and the very detailed description. I'll look into the issue soon.
Very nice bug report @fjossandon ! This is probably similar to the elusive issue #439
I did my best in March 2021 to reproduce and analyze what I think to be a multithreading issue, but without any success. I used rr and its option rr record --chaos
. According to its documentation rr
is good for finding multithreading bugs by artificially starving one thread.
I have been able to reproduce the problem by running vsearch 2.20.0 repeatedly with your input files at least under the following circumstances:
Usually the hang occurred before 100 runs were done. It seems to happen more often with just 2 threads.
It hangs after printing "Merging reads 100%" which indicates that it has processed all data, but do not understand that it is fully finished. I am pretty sure it has to do with some thread synchronisation on lines 1100-1300 in mergepairs.cc. Perhaps a deadlock. The code here is somewhat complex as threads are used to both read, write and process data in chunks.
The hang seems to happen when the number of input read pairs is a multiple of 500. This the chunk size, the number of reads processed at a time, and there is a special case in the code when the last set of sequences exactly fills this limit.
Hi @torognes , very interesting. I'm glad that you were able to reproduce it and narrow down to the affected code. It seems you are getting closer to the solution.
By the way, I mentioned that I found a second case were this happened, although that case had 8 reads missing at the end of the incomplete file instead of 1. The test case I gave had 166,000 reads, and I just checked and the second case had 124,500, also a multiple of 500, so I think that also supports your observation.
Fixed in commit 76c7d5560040596629ef1041640f255260a44d6f, I think. I forgot to test a rare condition that can happen due to timing of reading and writing files.
@fjossandon , could you see if the problem is still there? Or should I make a new release (2.20.1) for you first?
Hello @torognes , I downloaded the new patched version and executed my script several times and there was no hang. It seems that the problem is gone. Now I can remove the test files from Drive since they are no longer needed.
Thanks for fixing it quickly! =)
Great! Thanks for testing!
Hello, I frequently use several functions of Vsearch, and along the way I've noticed that the
--fastq_mergepairs
option sometimes just hangs and the process don't finish after a several hours while the CPU is doing nothing, something that I have not seen happening in other options like--uchime_denovo
or--usearch_global
. Just killing the process and launching the command again usually works fine, so it's hard to reproduce. It's also worth to mention that the command is usually executed in multiprocess.I tested Vsearch using the last version:
I recently had some time to search for a test case and put a small script to reproduce it, so I put it in Drive because the files are a little big: https://drive.google.com/drive/folders/1GEfmai3TarXrSBchLET68VamZ_IbaA46?usp=sharing
The compressed file contains a test case for which I originally run the fastq_mergepairs command using fastqout first to get the merged reads, and then again using fastqout_notmerged_fwd instead to get another file with only the forward sequences of the reads that could not be merged, but the unmerged command hanged in this case:
The compressed file have the following files:
I want to note that although the test scripts reuses around 10 different output files with the same input during execution to save space, in the original case all the parallel processes where handling different inputs and outputs.
I executed the sequential script a couple of times and Vsearch did not hang. The normal Vsearch execution only takes around 1 second each iteration. On the other hand, while executing the parallel script I found that 2 processes hanged out of the 1000 iterations. I was using a limit of 3 active processes, so the script was able to reach the last iteration with the 3rd one, but could not reach the last print because of the 2 that did not close. The prints looks like this:
The script cannot print the last "Finished!" message and stay like that unless killed.
In this last iteration, if I execute
ps ax
I can see that there were 2 processes still alive:In summary, this apparently happens when multiprocessing, and randomly so several repetitions are needed to trigger.
I'm clueless of why it happens, but seems specific to fastq_mergepairs. I hope you can reproduce it too with the scripts.
Best regards,