torognes / swarm

A robust and fast clustering method for amplicon-based studies
GNU Affero General Public License v3.0
123 stars 23 forks source link

Segmentation fault (or SIGABRT) with -a #35

Closed frederic-mahe closed 9 years ago

frederic-mahe commented 9 years ago

With several test datasets, I get a segmentation fault at the end of the clustering process when using the -a option. A gdb session shows

gdb swarm
...
run -a -b -d 1 < AF091148.fas
...
Number of swarms:  887
Largest swarm:     441
Max generations:   186

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff67f3700 (LWP 24981)]
0x00007ffff702d0c0 in __GI___call_tls_dtors () at cxa_thread_atexit_impl.c:83
83  cxa_thread_atexit_impl.c: Aucun fichier ou dossier de ce type.
(gdb) backtrace 
#0  0x00007ffff702d0c0 in __GI___call_tls_dtors () at cxa_thread_atexit_impl.c:83
#1  0x00007ffff7bc70b2 in start_thread (arg=0x7ffff67f3700) at pthread_create.c:319
#2  0x00007ffff70dac2d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

if I do the same using 2 threads, I obtain a different error:

gdb swarm
...
run -a -b -d 1 < AF091148.fas
...
[Thread 0x7ffff5ff2700 (LWP 25025) exited]
*** Error in `/home/fred/Science/Projects/Swarms/swarm/swarm': munmap_chunk(): invalid pointer: 0x00000000006403d0 ***
[Thread 0x7ffff57f1700 (LWP 25026) exited]

Program received signal SIGABRT, Aborted.
0x00007ffff702a077 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56  ../nptl/sysdeps/unix/sysv/linux/raise.c: Aucun fichier ou dossier de ce type.
(gdb) backtrace 
#0  0x00007ffff702a077 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007ffff702b458 in __GI_abort () at abort.c:89
#2  0x00007ffff7067fb4 in __libc_message (do_abort=do_abort@entry=1, fmt=fmt@entry=0x7ffff715abc0 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:175
#3  0x00007ffff706d78e in malloc_printerr (action=1, str=0x7ffff715abe8 "munmap_chunk(): invalid pointer", ptr=<optimized out>) at malloc.c:4996
#4  0x000000000040b76c in hash_free () at algod1.cc:142
#5  algo_d1_run () at algod1.cc:727
#6  0x000000000040169a in main (argc=<optimized out>, argv=0x7fffffffe2e8) at swarm.cc:428
(gdb) frame 4
#4  0x000000000040b76c in hash_free () at algod1.cc:142
142   free(hash_occupied);
(gdb) print hash_occupied 
$1 = (unsigned char *) 0x6403d0 "J\001"
torognes commented 9 years ago

Hm. Which version of SWARM is this?

frederic-mahe commented 9 years ago

The last one (Linux 1.2.16), fresh from the github repository. I recompiled (make) to get the debug symbols. The official binary and the recompiled one produce the same error. I don't know about the Mac binary (I don't have a Mac).

torognes commented 9 years ago

That was embarrassing. Too little memory was allocated for some arrays used to store temporary results. Too little testing too. For some reason I did not get a segfault on the Mac and it worked fine with multiple threads on the dataset I used for testing. Fixed now.