ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
144 stars 17 forks source link

Multiple mapping #431

Open yazhinia opened 2 months ago

yazhinia commented 2 months ago

Hello, Thank you developing this tool. I have a question related to the following point mentioned in the readme. The abundance is the number of bases mapped to a contig, divided by the length of the contig. Reads mapping to n different locations are weighted 1/n.

In my test run (even with -N option set to >1), I only see secondary alignments which need not be equally well mapped to reference like the primary alignment. How is multiple mapping defined here?

Also, is it possible to get MD flag field (like Bowtie2) to help compute sequence identity for the alignment?

Thank you.

ksahlin commented 2 months ago

Hi @yazhinia,

The abundance is the number of bases mapped to a contig, divided by the length of the contig. Reads mapping to n different locations are weighted 1/n.

The text above refers only when using the command-line option --aemb which outputs coverage estimates only. How are you running strobealign?

Also, is it possible to get MD flag field (like Bowtie2) to help compute sequence identity for the alignment?

Strobealign does not support an MD field, but with option --eqx it outputs CIGAR strings with =/X for matching and mismatching bases. Such CIGAR strings allow identity computation between the reads and the reference.

yazhinia commented 2 months ago

Thank you for the prompt reply. I wanted to understand how the aligner chooses n alignments for a read, if exist, in the --aemb mode? When you say Reads mapping to n different locations, does it mean mapping to multiple locations are equally perfect alignment? If I wanted to get n alignments that are perfect for every read, how to get it in output?

Is -N option in strobealign same as -k option in bowtie2?

Great, --eqx option is very useful.

ksahlin commented 2 months ago

I wanted to understand how the aligner chooses n alignments for a read, if exist, in the --aemb mode?

In --aemb mode it chooses based on seed hits and their scores, specifically the NAM score described in the strobealign paper. Aemb uses only the mapping location(s) with the best NAM score. If n identical best scoring locations it uses those n locations. No extension alignment is performed.

Is -N option in strobealign same as -k option in bowtie2?

Yes, it reports N alignment in decreasing order of score. The best score is the primary alignment.

If I wanted to get n alignments that are perfect for every read, how to get it in output?

You can set -N n --eqx (n your integer of choice) and parse cigar strings that only look like e.g. 150= if the read is of length 150. This will give you all "perfect" alignments, i.e., no difference to reference.

yazhinia commented 2 months ago

Thank you very much for the explanation.