Open jeromekelleher opened 1 year ago
Notice the mutations with 1 and 2 in the inheritors column here - why do they have very few inheritors? In this case they all correspond to "immediate" reversions, but more generally it may be useful to look at these as a way of quantifying how many of our reversions are artefactual.
Yep, that's definitely it:
We should add some diagnostic plots to the TreeInfo for this I think this historgram is useful, and also a scatter plot of path-distance to the parent vs number of inheritors would be really useful.
I'm slow on the uptake here. What is an "inheritor" vs a descendant?
An inheritor is a sample that directly inherits that specific mutation, and a descendant is any sample descending from the mutation's node at that site.
An inheritor is a sample that directly inherits that specific mutation,
Do you mean "without subsequent changes" - i.e. if a mutation reverts, does that reduce the number of inheritors? Otherwise I can't see why they would differ. Perhaps I'm being slow again, though.
I mean literally inheriting that specific mutation (object?), not the particular state change. So, mutation 10A>G has 2 inheritors and 2 descendants above, where as mutation 10C>A has 3 descendants and 1 inheritor.
Just want to note here that multiple folks (I recall specifically Katrina Lythgoe) have mentioned that some genomes with poorly covered bases are imputed using the reference sequence (Wuhan-Hu-1). This problem may be encountered more often among samples sequenced early on. I don't think those entries are replaced later on. We should figure out how to flag such samples and exclude them during initial ARG building. Perhaps we could exclude such imputed samples after an HMM run, if it is determined that many reversions are needed to explain them.
multiple folks... have mentioned that some genomes with poorly covered bases are imputed using the reference sequence
This point was brought up by Kelly Harris and Luca Ferrati too. We could look at how many non-reference reversions there are, over time or possibly by region (Luca thought that UK samples, for example, should be less likely to be imputed to the reference)
A quick way of finding some of these sequences may be to look at the sc2ts sample QC metadata. Presumably if they've been pre imputed, the number of masked sites will be small. We can then look at these samlles to see if they have excess reversions.
We should open a new issue to track I think?
Or samples that got lots of mutations going to a reference state? Not necessarily only reversions going back to a non-reference state. Also, I suspect that the imputed bases are concentrated in certain regions reported to suffer from amplicon dropouts. As I recall, the ARTIC V1 primer set, which was used to sequence lots of early samples, had quite a few documented dropouts (not just in Spike).
Yes, any mutations back to the reference state.
We have lots of reversions, but beyond "immediate" reversions (those reverting a mutation on the parent node), it's not clear how to identify them. One clear way might be to look at mutations that have very few "inheritors", e.g.: