ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
128 stars 16 forks source link

Mapping accuracy: --aemb vs -x vs base-level alignment #430

Open jakobnissen opened 6 days ago

jakobnissen commented 6 days ago

Dear strobealign devs

I'm looking to replace minimap2+CoverM with strobealign for my metagenomic binning purposes. We're currently running benchmarks to see if strobealign gives as accurate results as minimap2, with and without --aemb. Results are not in yet. Previous experience has shown us that accurate mapping, in the sense of being able to accurately choose the correct reference to align to, is important for good abundance estimation for binning, not because it provides an abundance estimate which is actually closer to the real abundance, but because accurate mapping allows reads to discriminate between closely related strains. I'm writing because I want to make sure I understand what --aemb, -x and the default options of strobealign does, such that I can understand what our results mean. Is it correct that

ksahlin commented 5 days ago

Dear @jakobnissen

Answers below.

All three modes collects seeds / chains in the same manner

Yes

-x and --aemb only differ in their output format, they do the same mapping

For unambiguous mappings yes. For reads with N best locations --aemb will contribute read_len/N to the coverage computation of all references that share the best hit. -x will output one location at random, unless another parameter is set to output several best locations.

-x and --aemb uses some kind of heuristic to determine the best mapping, given the set of sync-strobemer seeds / anchors / chains (I'm not sure about the right terminology here)

Yes, we compute a NAM score. NAM stands for Non-overlapping Approximate Matches. Sort of like approximate anchors. NAM score increase with the number of seed-hits involved in forming the NAM, together with the length of the NAM, roughly. Since the matches are approximate, they can have different lengths between the read and the reference (e.g. indel). A NAM is penalised with the length difference between the read and the ref. Here is the computation: https://github.com/ksahlin/strobealign/blob/3a97f6b817824235a36ca8f9c5710ee533d3ad56/src/nam.cpp#L107 The comment will explain well.

Default parameters collect the same chains, but then does base-level alignment to determine the best mapping position more accurately, and should therefore be able to better distinguish between two similar, plausible mapping positions?

Correct. default mode (i.e., extension mode) is more accurate than mapping mode, the magnitude of improvement depends on the read length. The tradeoff is that extension mode is slower than -x (and much slower than --aemb which removes a lot of intermediate steps to compute coverage).

Happy to hear what you''ll find in your benchmarks.

Related, we are about to release a new seeding strategy which improves the mapping accuracy for shorter reads (about 100-125nt or shorter), particularly in the mapping-only mode (-x and --aemb). It is probably ready in a month.

FYI @luispedro @Itolstoganov