torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
656 stars 122 forks source link

vsearch search_global time complexity #495

Closed jianshu93 closed 1 year ago

jianshu93 commented 2 years ago

Dear vsearch team,

I am curious what is the theoretical time complexity for search_global algorithm related to database sequence size (considering the same length for both database sequence and query sequence). Heuristics used in the command is based on kmer filtering right, similar to that of usearch search_global? It would be interesting to know the worst case or best case or average case et.al.

Thanks,

Jianshu

colinbrislawn commented 2 years ago

USEARCH has both usearch_global (k-mer indexing + alignment) and search_global (alignment only).

VSEARCH only includes usearch_global, but you can also configure it to do the same search:

If --maxaccepts and --maxrejects are both set to 0, the complete database is searched


Heuristics used in the command is based on kmer filtering right, similar to that of usearch search_global?

Nope, no heuristics and no filtering, so the full database is searched. Search time per-query increases linearly with database size. For this comprehensive and deterministic search, speed is always the same: slow 🐌

This is why settings like --maxaccepts 1024 --maxrejects 1024 should be almost as good and a lot faster! Here, k-mer indexing time increases with database size and alignment time increases with maxaccepts for similar reads and with maxrejects for divergent reads.

jianshu93 commented 2 years ago

Thanks for the response! This is helpful!

Thanks,

Jianshu