nextstrain / nextclade

Viral genome alignment, mutation calling, clade assignment, quality checks and phylogenetic placement
https://clades.nextstrain.org
MIT License
215 stars 58 forks source link

Documentation for private mutations #202

Closed garfinjm closed 3 years ago

garfinjm commented 4 years ago

Hello,

I'm interested to know a little more about how to the private mutation QC check is defined, I pretty regularly have samples that fail that check that otherwise look OK.

Thanks again for making such a great tool! Jake

ivan-aksamentov commented 4 years ago

Hi Jake @garfinjm,

Thanks for your kind words.

How QC works in general: we have a set of rules (4 rules currently) and each rule outputs a numeric score. Then the decision is made based on this score.

If you prefer to see the code, then this is how score is calculated in the "P" rule:

https://github.com/nextstrain/nextclade/blob/3621fdd0aaaa2f023f68791e19fba9fd36b6c85d/packages/web/src/algorithms/QC/rulePrivateMutations.ts#L15-L20

The privateMutations here contains mutations that are in the sequence, but not in the closest* reference tree node (what we call "private" mutations).

--

And then this is how the verdict is made based on score: https://github.com/nextstrain/nextclade/blob/3621fdd0aaaa2f023f68791e19fba9fd36b6c85d/packages/web/src/algorithms/QC/QCRuleStatus.ts#L7-L14

Basically, below 30 is good, above 99 is bad and what's in the middle is mediocre.

You can set typical and cutoff parameters in the settings (click "Settings" button on table view page). Note that these settings will persist across runs and page reloads, and if you want to reset them back to defaults, there is a button.

Let me know if this answers your question. Otherwise we might ask Richard to elaborate the scientific part.

Do you think it is a reasonable definition for this rule? Will you do it differently? Let's discuss!

rneher commented 4 years ago

Hi @garfinjm,

the private mutations score is essentially the number of mutations by which your sequences differ from the closest sequence in our reference tree. You are right that our threshold is pretty stringent (10 mutations). But given the dense coverage of SARSCoV2 diversity, our experience at nextstrain is that most sequences have close matches and a strain with 10 mutations that we haven't seen before is something we look at more closely. But in less well sampled lineages 10 private mutations can certainly happen and the flag for private mutations doesn't necessarily mean that the sequence is bad.

We could ignore mutations at the very 5' or 3' ends because they are masked in many downstream analysis anyway...

garfinjm commented 3 years ago

Thank you @ivan-aksamentov and @rneher for the explanation! I don't think the defaults need to be modified at all. A lot of the sequencing in my lab is from outbreaks in congregate settings, where if the the introductary case has a lot of private mutations (almost) all of the samples have a lot of private mutations. At first glance this always looks like a lot of red/yellow in Nextclade, but makes sense as long as they share the same mutations.

private_mutations