torognes / swarm

A robust and fast clustering method for amplicon-based studies
GNU Affero General Public License v3.0
123 stars 23 forks source link

Swarm does not start clustering in rare cases #110

Closed torognes closed 6 years ago

torognes commented 6 years ago

In a rare case when using d=5 and 10 threads clustering did not start. Swarm was waiting with 0% progress. It worked fine the next time, so it is not easily reproducible. May be related to isssue #78. May have to do with multi-threading.

frederic-mahe commented 6 years ago

I can replicate the bug. If I monitor it with valgrind, and kill the process after a while:

valgrind swarm -d 5 -t 8 -z ../../../../examples/AF091148.fas -o /dev/null
==19227== Memcheck, a memory error detector
==19227== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==19227== Using Valgrind-3.12.0.SVN and LibVEX; rerun with -h for copyright info
==19227== Command: swarm -d 5 -t 8 -z ../../../../examples/AF091148.fas -o /dev/null
==19227==
Swarm 2.2.1 [Oct 28 2017 10:00:41]
Copyright (C) 2012-2017 Torbjorn Rognes and Frederic Mahe
https://github.com/torognes/swarm

Mahe F, Rognes T, Quince C, de Vargas C, Dunthorn M (2014)
Swarm: robust and fast clustering method for amplicon-based studies
PeerJ 2:e593 https://doi.org/10.7717/peerj.593

Mahe F, Rognes T, Quince C, de Vargas C, Dunthorn M (2015)
Swarm v2: highly-scalable and high-resolution amplicon clustering
PeerJ 3:e1420 https://doi.org/10.7717/peerj.1420

CPU features:      mmx sse sse2 sse3 ssse3 sse4.1 sse4.2 popcnt avx avx2
Database file:     ../../../../examples/AF091148.fas
Output file:       /dev/null
Resolution (d):    5
Threads:           8
Scores:            match: 5, mismatch: 4
Gap penalties:     opening: 12, extension: 4
Converted costs:   mismatch: 18, gap opening: 24, gap extension: 13
Break OTUs:        Yes
Fastidious:        No

Reading database:  100%
Indexing database: 100%
Database info:     180704 nt in 1403 sequences, longest 137 nt
Find qgram vects:  100%
Clustering:        0%

^C==19227==
==19227== Process terminating with default action of signal 2 (SIGINT)
==19227==    at 0x4E4515F: pthread_cond_wait@@GLIBC_2.3.2 (pthread_cond_wait.S:185)
==19227==    by 0x118C12: qgram_diff_fast(unsigned long, unsigned long, unsigned long*, unsigned long*) (qgram.cc:327)
==19227==    by 0x11316C: algo_run() (algo.cc:175)
==19227==    by 0x109BE0: main (swarm.cc:680)
==19227==
==19227== HEAP SUMMARY:
==19227==     in use at exit: 2,779,036 bytes in 84 blocks
==19227==   total heap usage: 11,560 allocs, 11,476 frees, 3,535,040 bytes allocated
==19227==
==19227== LEAK SUMMARY:
==19227==    definitely lost: 0 bytes in 0 blocks
==19227==    indirectly lost: 0 bytes in 0 blocks
==19227==      possibly lost: 4,608 bytes in 16 blocks
==19227==    still reachable: 2,774,428 bytes in 68 blocks
==19227==         suppressed: 0 bytes in 0 blocks
==19227== Rerun with --leak-check=full to see details of leaked memory
==19227==
==19227== For counts of detected and suppressed errors, rerun with: -v
==19227== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

the programs seems to be waiting for the qgram_diff_fast function to finish. Using valgrind -v provides instructions on how to debug that stuck process in gdb.

frederic-mahe commented 6 years ago

another run gets stuck on qgram_diff_fast, same sub-function call:

pthread_cond_wait(&tip->workcond, &tip->workmutex);
torognes commented 6 years ago

Thanks, it clearly seems to be an issue with the pthread_cond_wait and associated pthread_cond_signal calls that does not work well in all cases.

frederic-mahe commented 6 years ago

That seems to be a deadlock. I've been trying to get more info by using valgrind --tool=helgrind but I cannot replicate the bug in that context.

torognes commented 6 years ago

I think it is fixed with the latest commit.

frederic-mahe commented 6 years ago

Do you think that would solve #78 too?

torognes commented 6 years ago

yes

frederic-mahe commented 6 years ago

The patch does not change any of the logic covered by the tests (all tests pass).

torognes commented 6 years ago

Fixed in Swarm 2.2.2 just released.