torognes / swarm

A robust and fast clustering method for amplicon-based studies
GNU Affero General Public License v3.0
123 stars 23 forks source link

segmentation fault error when using swarm with -d 1 #109

Closed mariabernard closed 6 years ago

mariabernard commented 6 years ago

Hello everyone,

We use Swarm in the FROGS pipeline to cluster amplicon sequences. But for a particular dataset swarm (2.1.13 or 2.2.0) return a segmentation fault that I can not explain.

This appaer only when using -d 1 on the complete dataset.

grep -c '>' full_dataset.fa
28751
swarm --differences 2 --threads 1 --log swarm_test_log.txt --output-file test.compo full_dataset.fa 
swarm --differences 1 --threads 1 --log swarm_test_log.txt --output-file test.compo full_dataset.fa 
Erreur de segmentation (core dumped)

head -n 30000 full_dataset.fa > test.fa
grep -c '>' test.fa 
15000
swarm --differences 2 --threads 1 --log swarm_test_log.txt --output-file test.compo test.fa 
swarm --differences 1 --threads 1 --log swarm_test_log.txt --output-file test.compo test.fa 

tail -n 30000 full_dataset.fa > test.fa 
grep -c '>' test.fa 
15000 # some sequences are common with the previous test
swarm --differences 2 --threads 1 --log swarm_test_log.txt --output-file test.compo test.fa 
swarm --differences 1 --threads 1 --log swarm_test_log.txt --output-file test.compo test.fa 

the particularity of the dataset is that sequences have a large range of size : from 70nt to 620, and some sequences have an artificial stretch of 100 A.

Any idea, what is causing this error ?

Best regards

Maria Bernard

torognes commented 6 years ago

Thank you for reporting this problem. It is difficult to see what the cause of the problem is. Could you please provide the output of the swarm_test_log.txt file for the case where a segmentation error occurs? If it is possible to provide the entire dataset that would be very helpful.

mariabernard commented 6 years ago

The swarm log is empty.

I sent you an email to download the dataset.

Thank you for your help.

Maria

torognes commented 6 years ago

Thanks, I have received the file and am trying to identify the bug.

torognes commented 6 years ago

I get the same error on Linux, but it appears to run fine on my Mac. Looks like a memory allocation error.

torognes commented 6 years ago

There was an embarrassing memory allocation error in the code for d=1 resulting in too little memory allocated in cases where there was large variability in the length of sequences.

The bug has been fixed in Swarm version 2.2.1 just released.

Thanks again for reporting this issue!

torognes commented 6 years ago

The bug actually appeared due to a bug in memory allocation that shows up in some cases when using input sequences that are not dereplicated. I do not think the bug would appear if the input is dereplicated.

frederic-mahe commented 6 years ago

Hi, I've created a unit test for that specific bug (see swarm-tests). With undereplicated length-1 input sequences, we need 1 seed + 2 * (7 + 4) + 1 = 1 + 22 + 1 = 24 sequences to trigger the bug.

for ((i=1 ; i<=23 ; i++)) ; do  printf ">s%d_1\nA\n" ${i} ; done | swarm &> /dev/null
# OK
for ((i=1 ; i<=24 ; i++)) ; do  printf ">s%d_1\nA\n" ${i} ; done | swarm &> /dev/null
# segfault

When we tested swarm's handling of undereplicated sequences, all the tests we wrote used longer input sequences. The array was then much bigger and there was not enough sequence matches to trigger the bug :-(

mariabernard commented 6 years ago

Many thanks!

I add a dereplicating step and even with swarm 2.2.0, it works!