Closed frederic-mahe closed 2 years ago
I've found the time to dig a little deeper:
export LC_ALL=C
# make fasta files (two entries) with very long sequences
for LENGTH in {67108861..67108862..1} ; do
FASTA="tmp_${LENGTH}.fas"
(
printf ">s1_1\n"
yes A | head -n "${LENGTH}" | tr -d "\n"
printf "\n"
printf ">s2_1\n"
(printf "T\n" ; yes A) | head -n "${LENGTH}" | tr -d "\n"
printf "\n"
) > "${FASTA}"
swarm --output /dev/null "${FASTA}"
rm "${FASTA}"
done
segmentation fault
That threshold is a bit odd. I was expecting something in line with the usual max values for integers. What do you think @torognes ?
Having several 67,108,861-nt fasta entries is ok (tested with 2, 3 and 4 entries). I will update the documentation to indicate that swarm can process fasta entries of up to 67 Mb.
If I can, I'll try to run the test in a debugger to check what triggers the segmentation fault
.
Note: the length limit is very close to 2^16×1024 = 67,108,864
Yes, the limit is 2^26 - 3 = 64MB - 3, where M=1024*1024.
I can confirm that I get exactly the same limit on macOS. BTW it is eating up all memory (34GB or so)...
I'll also try to find out where it stops.
Maybe we should anyway set a hard limit at a reasonable length.
I think the problem is in the zobrist_init
function in zobrist.cc
. It is called from db_read
in db.cc
. This function allocates the arrays zobrist_tab_base
which is 32n bytes big and zobrist_tab_byte_base
which is 512n bytes big. Here n is the length of the longest sequence + 2. Some of the calculations are done with unsigned int's (32 bits). I think the addressing of the zobrist_tab_byte_base
at the end of the function will fail when n is larger than 64M.
We could change the unsigned int's here to uint64_t to enable longer sequences. But I also think that we should consider setting set a shorter fixed limit to the lengths, e.g. 1Mbp or so, since the program will use too much memory and be so slow that it is not practical to use it with longer sequences.
Thanks for searching what causes the issue. I did not realize the memory consumption was so high. In light of this, I suggest we do the following:
If you agree, I will close that issue in the next coming days.
I mostly agree, but I think it is better to set a fixed maximum sequence limit and give an informative error message if it is broken, than just crashing with a segmentation fault. We could set that limit as high as 67,108,861 or to a slightly lower but "round" number.
I though about it since yesterday, and I think you are right. That should be handled gracefully, not with a crash.
Hopefully, that length check can be done without any noticeable slowdown?
I am in favor of setting the limit has high as currently possible. Having a round number could be easier to explain and to remember, but we could also document the '2^26 - 3' limit, and its relation to the creation of a table of size larger than 2^32, too large for our internal uint32 pointers. It shows users that the limit is not arbitrary, but rather linked to computer inner workings.
In the same vein, our documentation states that header lengths are limited to 2,048 characters, not 2,000. I think curious users can learn from that.
Swarm will now stop with a fatal error if a sequence is too long. It checks the length after reading each line from the input - that shouldn't affect performance. This is in commit fac4664e9435c77f59f9d8aee03a695efbd51c1c.
There is no specific limit imposed on header lengths anymore. But I think a 32-bit integer is used to keep the length, so it will probably not work with headers longer than 2^63 - 1 characters (2GB-1). Should be enough in most cases. :-)
There is no specific limit imposed on header lengths anymore. But I think a 32-bit integer is used to keep the length, so it will probably not work with headers longer than 2^63 - 1 characters (2GB-1). Should be enough in most cases. :-)
Thanks, I will investigate the new max length for headers. In the meantime, that issue can be closed.
(low priority)
swarm can process input sequences of 10 megabytes, but returns a segmentation fault with 100-Mb input sequences.
I would like to investigate why, and check whether that limitation could be easily lifted. If not, that's perfectly ok, but I would like to give a clear maximal sequence length value in the manual.