torognes / swarm

A robust and fast clustering method for amplicon-based studies
GNU Affero General Public License v3.0
121 stars 23 forks source link

maximal header length #171

Closed frederic-mahe closed 1 year ago

frederic-mahe commented 2 years ago

Header length used to be limited to 2,048 characters. There is no specific limit imposed on header lengths anymore. Yet, a long header can trigger a segmentation fault on my machine (SIGSEGV or signal 11 on Linux):

cd /tmp/

export LC_ALL=C

# make a fasta file (one entry) with a very long header
LENGTH=16777213  # success  35,687,712kB of RAM
LENGTH=16777214  # failure  Command terminated by signal 11  2,133,464 kB of RAM

FASTA="tmp_${LENGTH}.fas"
(
    printf ">"
    yes A | head -n "${LENGTH}" | tr -d "\n"
    printf "_1\nA\n"
) > "${FASTA}"

/usr/bin/time swarm --output /dev/null "${FASTA}"

rm "${FASTA}"

(Note the large amount of memory used: 35 GB)

The maximum length before failure is 16,777,213 + > + _1 = 16,777,216 = 2^24. Maybe we could add a check and a call to fatal() if header length is greater than that?

frederic-mahe commented 1 year ago

Fixed by commit 97beb00024addbebc3c9be80499ebd87a424fca3

Feel free to re-open if need be.