nextstrain / nextclade

Viral genome alignment, mutation calling, clade assignment, quality checks and phylogenetic placement
https://clades.nextstrain.org
MIT License
214 stars 58 forks source link

Is it by design that sequence with lots of Ns are classified as recombinant? #918

Closed danrlu closed 2 years ago

danrlu commented 2 years ago

This sequence has mostly Ns with 1 mutation and is "recombinant". Just checking that this is intended behavior. Certainly a little confusing.

image
corneliusroemer commented 2 years ago

Hi Dan!

A sequence with 28816 Ns is very much a total edge case. How many non-N bases do you even have? It can't be more than a thousand!

In that regime, it's basically impossible to place sequences phylogenetically with any confidence - too little information.

The reason this ends up as a recombinant is that recombinants have stretches from two lineages, so in order to say whether something is a recombinant or not you need non-N bases from either side of the breakpoint. Otherwise it's impossible to be sure and one needs to use some sort of prior (which in this case would be very much against this sequence being a recombinant).

But we don't do any prior weighting - we just traverse the tree and choose the best match - maybe we could bias against assigning a recombinant if there's a non-recombinant that has the same parsimony score.

Otherwise, you could run it through the dataset that has no recombinants:

image

I hope that helps. There are two potential improvements we can make:

  1. Improve the lineage we assign when there are equally parsimonious placements, e.g. bias against recombinants
  2. Add a note of the number of equally parsimonious placements to show uncertainty

One question for you: did you notice the high number of Ns? The sequence is colored red so one shouldn't really trust the lineage assignment. Should that be made clearer?

danrlu commented 2 years ago

This all makes sense. Very interesting to think through!

This came up as we were QC-ing the sequence, and usually we would discard anything with so many Ns, but seeing it as recombinant made us think twice and wondering whether we would miss anything interesting by throwing it away... But as you said the lineage assignment is not meaningful in this case. I remember there are other reasons a sequence would be red, right? Like too many mixed sites, I don't automatically associate red sequences w not trusting the lineage. I think for seq w too many Ns, it's best to not show lineage at all.

I think in general it could be helpful to show confidence in the lineage assignment one way or another, but then the question become whether that will be additionally confusing, and that's a bigger Q than this issue at hand.

Thanks very much for the great tool and always improving it!

corneliusroemer commented 2 years ago

Thanks for the feedback, always good to hear how users interpret things! Yeah, showing number of equally parsimonious lineages would have flagged this very clearly here.