ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
128 stars 16 forks source link

strobealign: Too many reference sequences. #421

Open lfdelzam opened 2 months ago

lfdelzam commented 2 months ago

Hi, I am using the latest version 0.13, and when mapping reads against co-assembled contigs, no sam file is created. This message appears in the log file:

This is strobealign 0.13.0 Estimated read length: 144 bp Time reading reference: 34.04 s Reference size: 15899.10 Mbp (17425146 contigs; largest: 1.20 Mbp) strobealign: Too many reference sequences. Current maximum is 16777215

It seems the indexing step is not working.

The same version worked when mapping the same samples against individually assembled contigs. Here is the log file: This is strobealign 0.13.0 Estimated read length: 145 bp Time reading reference: 2.20 s Reference size: 849.75 Mbp (996534 contigs; largest: 0.74 Mbp) Indexing ... Time counting seeds: 1.23 s Time generating seeds: 3.40 s Time sorting seeds: 4.05 s Time generating hash table index: 1.83 s Total time indexing: 10.52 s Running in paired-end mode Done! Total mapping sites tried: 114025071 Total calls to ssw: 16525791 Inconsistent NAM ends: 52038 Tried NAM rescue: 10875413 Mates rescued by alignment: 3786350 Total time mapping: 330.52 s. Total time reading read-file(s): 43.19 s. Total time creating strobemers: 40.19 s. Total time finding NAMs (non-rescue mode): 79.54 s. Total time finding NAMs (rescue mode): 1.79 s. Total time sorting NAMs (candidate sites): 1.94 s. Total time base level alignment (ssw): 104.23 s. Total time writing alignment to files: 0.00 s.

Thanks in advance for your time and help.

lfdelzam commented 2 months ago

by removing contigs < 500bs, the number of contigs is lower than 16777215 and it works. Why is 16777215 the limit? would it be possible to modify it?

marcelm commented 2 months ago

Hi, the limit is so because the indexing data structure that strobealign uses reserves 24 bits for the contig index. The maximum number of contigs is thus $2^{24}=16777216$. (I think the message should say 16777216 instead of 16777215.)

This is currently a hard limit that cannot be changed by command-line parameters.

We have been thinking about changing the way the index is stored in memory (for example in #227) and some of these changes could free up some bits that could then be used for the contig index, but we have not made any decisions.

ksahlin commented 2 months ago

Given the recent interest in aligning/mapping to large metagenomics datasets, I would consider even allocating a full 32 bits for the contig ID, and steal the 8-bits for the strobe offset from the 64bit hash value. Having the 8 bits offset as the lowest 8 bits in the hash value may even be a feature since we can look for/prefer strobemer matches with matching length if multiple matches (this is a bulletpoint in the development document).