jeromekelleher / sc2ts

Infer a succinct tree sequence from SARS-COV-2 variation data
MIT License
4 stars 3 forks source link

Max mutation parents outlier in Viridian data #228

Open jeromekelleher opened 1 month ago

jeromekelleher commented 1 month ago

As of 2021-02-26 we have a max_mutation_parents value of 129, which is clearly pathological. Investigate.

jeromekelleher commented 1 month ago

max_mutations_per_site is also much higher at 2060 vs 890.

jeromekelleher commented 1 month ago

This site is a likely cause:

21846

szhan commented 1 month ago

The mutation C>T at position 21846 lies within amplicon 72 (ARTIC v3), which suffers from dropout in Delta samples, and so the position may be affected by sequencing artifacts (see this paper).

szhan commented 1 month ago

Do the reversions tend to happen within Delta (B.1.617.2) lineages?

jeromekelleher commented 1 month ago

Confirming that this site is the one with a chain of 128 successive mutations. The mutations with > 2 parents are flip-flopping between C and T.

jeromekelleher commented 1 month ago

28271 is the next highest mutation count, and is also likely problematic:

28271

jeromekelleher commented 1 month ago

Note that 28271 showed similar problems in the GISAID data, so seems pretty likely to be problematic and a good candidate for exclusion

jeromekelleher commented 1 month ago

The next site with highest mutation count is 27638. This looks different, with consistent flicking back and forth between T and C:

27638

jeromekelleher commented 1 month ago

27752 seems quite similar:

27752