Question: Identity threshold

stas-malavin commented 4 days ago

Hello, I seem to miss something in the algorithm, could you please specify how do you reject a read from being mapped without identity threshold? Or there is one implicitly? Thank you!

marcelm commented 4 days ago

There’s no explicit threshold. The implicit requirement is that there is at least one seed in the query that can also be found in the reference. Our seeds are randstrobes that consist of two $k$-mers, where $k$ depends on read length but is typically 20. Note that not all pairs of 20-mers are indexed, so two specific 20-mers need to match between query and reference.

I don’t think the read mappers I’m aware of implement an explicit identity threshold. If you wanted to have such a threshold, you’d need to add this as a postprocessing step (for example, by inspecting the NM:i tag in the SAM output).

stas-malavin commented 4 days ago

Hi Marcel, thanks for the quick reply. Again, I'm sorry for a stupid newbie question. I usually use bbmap which has a minid parameter, i.e., 'Approximate minimum alignment identity to look for', according to the manual, with the default value of 0.76.

I imagine a situation, where a metagenome contains two closely related species. If a read of, say, 150 bp covers half the conservative region and half the variable region (e.g. of the SSU gene or some protein-coding gene with a conservative domain), could it be so that both 20-mers hit the conservative part, and you got an ambiguous alignment?

That's why we typically calculate actual alignment when k-mers match, right? And as we calculate, why not implementing an id threshold?

marcelm commented 3 days ago

You get more than just one randstrobe per query. For a 150 bp read, for example, you get about 30 on average (they overlap each other). As long as one of them hits the unique part of the reference, the read will be mapped correctly, even when you run strobealign in mapping-only mode (-x), in which it will not compute alignments.

A more difficult situation would arise when most of the read is in the conserved region and only a small part of it covers a region that is variable. Then it could be that all randstrobes are in in a conserved region, making the mapping location ambiguous. In mapping-only mode, strobealign picks one of the possible locations at random, but in alignment mode, it computes alignments for the candidate locations and uses that to determine which location is the more likely one.

If I understand correctly, the minid parameter in bbmap is a way to trade speed for sensitivity. The documentation says this:

minid=0.76              Approximate minimum alignment identity to look for. 
                        Higher is faster and less sensitive.

So this appears to be intrinsic to the way their algorithm works: If you want the program to find alignments with lower identity (i.e., increase sensitivity), you can reduce this parameter, but it will be slower.

Strobealign’s algorithm does not need to make this tradeoff, so it doesn’t offer this particular knob.

Of course something like this could be done in a separate step where strobealign would filter out alignments that are below a certain percentage identity, but I don’t see what the advantage would be: Assume strobealign finds a match at 75% identity and our threshold is 76% – why should we throw that match away? (The point is that bbmap would possibly nt have found that match in the first place because it was tuned to find matches with at least 76% identity.)

stas-malavin commented 3 days ago

OK, I think I got the idea. Just to clarify—for the alignments, strobealign has some score threshold, right?

marcelm commented 3 days ago

No, there’s no threshold either, but there is soft clipping: If extending the alignment fully towards the 5' or 3' end would lead to a bad alignment score, the alignment is truncated. Together with the first requirement of at least two 20-mers matching means that I typically don’t see really bad alignments. There are sometimes quite short alignments (with many soft-clipped bases), but you could get rid of these by filtering by alignment score (for example with sth. like samtools view -e '[AS]>=100').

If during your testing of strobealign you find alignments that make no sense for your application and that you think strobealign should not have reported, please let us know.

stas-malavin commented 3 days ago

If I use an --aemb flag, I don't have sam to filter. Do short alignments show up in the resulting tsv?

I actually want to map metagenomic reads to just one genome, not to the whole spectrum of diversity from which the reads were obtained. In this case, will I get all the reads mapped (sounds odd)?

Next, I'm working with eukaryotes for which I have not enough coverage to assemble MAGs. I download WGS project for a species whose 18S is 3% different from what I found in the metagenome, and assemble it. Now I map the metagenomic reads onto this assembly. How do I tell strobealign to map just the reads that are, say, not less than 90% similar to the reference?

marcelm commented 2 days ago

If I use an --aemb flag, I don't have sam to filter. Do short alignments show up in the resulting tsv?

In the .tsv that you get with --aemb, all matches are counted, even short ones. (I’m avoiding the term "alignment" because no actual base-level alignments are computed.)

I actually want to map metagenomic reads to just one genome, not to the whole spectrum of diversity from which the reads were obtained. In this case, will I get all the reads mapped (sounds odd)?

Ah, I understand now. Yes, I believe most reads would be mapped.

Next, I'm working with eukaryotes for which I have not enough coverage to assemble MAGs. I download WGS project for a species whose 18S is 3% different from what I found in the metagenome, and assemble it. Now I map the metagenomic reads onto this assembly. How do I tell strobealign to map just the reads that are, say, not less than 90% similar to the reference?

Because no alignments are computed when you output abundances (--aemb) or PAF (-x), similarity is also not computed, so there’s no way to filter by similarity in these modes. You would need to use SAM output, then filter the way I suggested earlier, and then convert to PAF or an abundances .tsv.

That said, while the actual alignment is not available, strobealign knows the length of the match that it finds. When using -x, this is output in column 11 in the output file. When a query matches the reference only partially, this length is much less than the read length. So maybe one way to get something similar to what you want is to filter by this length. Currently, you would have to use PAF output and then, again, do this filtering manually, but if it is indeed helpful, this could be added to strobealign as an option for --aemb. I am unsure whether this is maybe too specialized, @ksahlin would need to decide.

stas-malavin commented 3 hours ago

Hi Marcel, thank you very much for the explanations and a helpful discussion! I'll try to update this thread with the results I get using the approach you suggested.

ksahlin / strobealign

Question: Identity threshold #462