torognes / swarm

A robust and fast clustering method for amplicon-based studies
GNU Affero General Public License v3.0
125 stars 23 forks source link

bug when using several threads #74

Closed frederic-mahe closed 8 years ago

frederic-mahe commented 8 years ago

Richard Christen reported a segmentation fault on a big dataset. While investigating it, I found the following bug:

cat test.fas 
>a1_1
ACGT
>a2_1
ACGT
>b1_1
TGCA
for ((i=1 ; i<=30 ; i++)) ; do
    echo -ne "$i\t"
    swarm -t $i < test.fas 2> /dev/null | wc -l
done

Results should be 2 OTUs for any -t value, but for -t 6 and greater, swarm returns 3 OTUs.

frederic-mahe commented 8 years ago

Changing the test file to:

>a1_1
ACGT
>a2_1
ACGT
>b1_1
TGCA
>c1_1
AAAA
>c2_1
AAAA

yields 6 OTUs instead of 3 when -t is greater than 5 (on my 2-core laptop and on a 16-core node). There is no problem when using d > 1.

torognes commented 8 years ago

I am looking into the problem now. On my Mac I get exactly the same results as you do. Not good.

torognes commented 8 years ago

I have found the source of the bug and a solution. It affects only cases where the input sequences have not been properly dereplicated, i.e. when there are two or more identical copies of some of the sequences in the input. It also only appears when the sequences (or any microvariants) are shorter than the number of threads-1. This applies to both of the examples. I'll fix and release a new version soon.

frederic-mahe commented 8 years ago

Pfffew! So that's a very input-specific bug. What a relief!

The bug-fix release description should be something like that: "version 2.1.8 fixes a rare bug triggered when clustering extremely short undereplicated sequences."

torognes commented 8 years ago

Fixed in 2.1.8.