tjcreedy / metamate

Your metabarcoding friend! Filter erroneous and unwanted amplicons.
GNU General Public License v3.0
6 stars 1 forks source link

Minimum overlap below 100% ASV length - implications for authenticity of ASVs #3

Closed naurasd closed 2 years ago

naurasd commented 2 years ago

Hi @tjcreedy

Second issue from my side:

In your paper, it says:

To classify an ASV as a va-ASV requires 100% similarity across the overlapping region with a reference sequence. The required minimum overlap can be adjusted in metaMATE, with a default value of 80% of the ASV sequence length. A minimum overlap smaller than the length of ASVs implies that two or more ASVs can match with 100% similarity with the same reference, if these ASVs differ only in the nonoverlapping region. In such cases, we recommend that none of the ASVs should be considered as va-ASVs.

I was wondering if this is appropriate given that for COI the databases usually lack sufficient volume of reference sequences. This implies that we should not apply a minimum overlap to a reference sequence equaling the entire length of ASVs (=100% minimum overlap) because only ASVs for which an identical sequence exists in the reference set could be identified as verified authentic then. So if I see correctly, we should choose a minimum overlap below ASV length. However, you suggest to not consider ASVs passing the 100% similarity threshold when minimum overlap is below 100% of ASV length as verified authentic.

What are your thoughts on this, did I misunderstand your explanation in the paper?

Thanks so much,

Nauras

tjcreedy commented 2 years ago

Hi Nauras,

You're completely right about the variability in coverage of databases, hence why we provide the options to customise the va-ASV specification to suit the data availability for your target taxon and COI region, however this will give uncertain matches. Generally, we can't otherwise assess whether uncertain matches are accurate or not. However, in a specific edge case where we have the same reference being matched to by more than one ASV, we might suspect that this match is not appropriate because we do not expect more than one correct ASV for a given genotype for a given COI region.

If I read your message correctly, I suspect this is where there is a slight misunderstanding. The last sentence of the quote from our paper (" In such cases, we recommend that none of the ASVs should be considered as va-ASVs.") refers specifically to this edge case where multiple ASVs match to the same reference while using similarity / overlap settings of less than 100%, not all matches with these settings. However, this is only a conceptual recommendation at this stage, and it could be argued the other way. Thus as it stands the current implementation of metamate does not actually implement this caveat - all matches with the given settings will be considered va-ASVs.

Hope that helps, Thomas

naurasd commented 2 years ago

Hi Thomas,

thanks for the clarification. I will see how my ASVs behave with metaMATE and if it actually occurs that several ASVs match the same reference. Will close this issue for now, might reopen if something comes up.

Thanks so much, Nauras