torognes / swarm

A robust and fast clustering method for amplicon-based studies
GNU Affero General Public License v3.0
121 stars 23 forks source link

documentation: maximal input sequence length #163

Closed frederic-mahe closed 2 years ago

frederic-mahe commented 3 years ago

(low priority)

LENGTH=$(( 100 * 1024 * 1024 ))
SEQ=$(yes A | head -n "${LENGTH}" | tr -d "\n")
printf ">s1_1\n%s\n>s2_1\n%s\n" "${SEQ}" "$(sed 's/A/T/1' <<< ${SEQ})" | swarm

swarm can process input sequences of 10 megabytes, but returns a segmentation fault with 100-Mb input sequences.

I would like to investigate why, and check whether that limitation could be easily lifted. If not, that's perfectly ok, but I would like to give a clear maximal sequence length value in the manual.

frederic-mahe commented 2 years ago

I've found the time to dig a little deeper:

export LC_ALL=C

# make fasta files (two entries) with very long sequences
for LENGTH in {67108861..67108862..1} ; do
    FASTA="tmp_${LENGTH}.fas"
    (
        printf ">s1_1\n"
        yes A | head -n "${LENGTH}" | tr -d "\n"
        printf "\n"
        printf ">s2_1\n"
        (printf "T\n" ; yes A) | head -n "${LENGTH}" | tr -d "\n"
        printf "\n"
    ) > "${FASTA}"
    swarm --output /dev/null "${FASTA}"
    rm "${FASTA}"
done

That threshold is a bit odd. I was expecting something in line with the usual max values for integers. What do you think @torognes ?

Having several 67,108,861-nt fasta entries is ok (tested with 2, 3 and 4 entries). I will update the documentation to indicate that swarm can process fasta entries of up to 67 Mb.

If I can, I'll try to run the test in a debugger to check what triggers the segmentation fault.

frederic-mahe commented 2 years ago

Note: the length limit is very close to 2^16×1024 = 67,108,864

torognes commented 2 years ago

Yes, the limit is 2^26 - 3 = 64MB - 3, where M=1024*1024.

I can confirm that I get exactly the same limit on macOS. BTW it is eating up all memory (34GB or so)...

I'll also try to find out where it stops.

Maybe we should anyway set a hard limit at a reasonable length.

torognes commented 2 years ago

I think the problem is in the zobrist_init function in zobrist.cc. It is called from db_read in db.cc. This function allocates the arrays zobrist_tab_base which is 32n bytes big and zobrist_tab_byte_base which is 512n bytes big. Here n is the length of the longest sequence + 2. Some of the calculations are done with unsigned int's (32 bits). I think the addressing of the zobrist_tab_byte_base at the end of the function will fail when n is larger than 64M.

We could change the unsigned int's here to uint64_t to enable longer sequences. But I also think that we should consider setting set a shorter fixed limit to the lengths, e.g. 1Mbp or so, since the program will use too much memory and be so slow that it is not practical to use it with longer sequences.

frederic-mahe commented 2 years ago

Thanks for searching what causes the issue. I did not realize the memory consumption was so high. In light of this, I suggest we do the following:

  1. I am going to deactivate the unit-test for extremely long sequences (too slow, too much memory needed),
  2. no need to change swarm's code to allow for even longer sequences (impractical, too much memory),
  3. I update the documentation to indicate that processing 67-Mb sequences requires at least 32 gigabytes of RAM,
  4. consequently, I don't think we need to put a 1-Mb cap on sequence lengths (let users process very long sequences if they want to, the hard limit is clearly documented)

If you agree, I will close that issue in the next coming days.

torognes commented 2 years ago

I mostly agree, but I think it is better to set a fixed maximum sequence limit and give an informative error message if it is broken, than just crashing with a segmentation fault. We could set that limit as high as 67,108,861 or to a slightly lower but "round" number.

frederic-mahe commented 2 years ago

I though about it since yesterday, and I think you are right. That should be handled gracefully, not with a crash.

Hopefully, that length check can be done without any noticeable slowdown?

I am in favor of setting the limit has high as currently possible. Having a round number could be easier to explain and to remember, but we could also document the '2^26 - 3' limit, and its relation to the creation of a table of size larger than 2^32, too large for our internal uint32 pointers. It shows users that the limit is not arbitrary, but rather linked to computer inner workings.

In the same vein, our documentation states that header lengths are limited to 2,048 characters, not 2,000. I think curious users can learn from that.

torognes commented 2 years ago

Swarm will now stop with a fatal error if a sequence is too long. It checks the length after reading each line from the input - that shouldn't affect performance. This is in commit fac4664e9435c77f59f9d8aee03a695efbd51c1c.

There is no specific limit imposed on header lengths anymore. But I think a 32-bit integer is used to keep the length, so it will probably not work with headers longer than 2^63 - 1 characters (2GB-1). Should be enough in most cases. :-)

frederic-mahe commented 2 years ago

There is no specific limit imposed on header lengths anymore. But I think a 32-bit integer is used to keep the length, so it will probably not work with headers longer than 2^63 - 1 characters (2GB-1). Should be enough in most cases. :-)

Thanks, I will investigate the new max length for headers. In the meantime, that issue can be closed.