Classification for non-biological reversions

jeromekelleher / sc2ts

Infer a succinct tree sequence from SARS-COV-2 variation data

MIT License

4 stars 3 forks source link

Classification for non-biological reversions #100

Open jeromekelleher opened 1 year ago

jeromekelleher commented 1 year ago

We have lots of reversions, but beyond "immediate" reversions (those reverting a mutation on the parent node), it's not clear how to identify them. One clear way might be to look at mutations that have very few "inheritors", e.g.:

almost-no-inheritors

jeromekelleher commented 1 year ago

Notice the mutations with 1 and 2 in the inheritors column here - why do they have very few inheritors? In this case they all correspond to "immediate" reversions, but more generally it may be useful to look at these as a way of quantifying how many of our reversions are artefactual.

jeromekelleher commented 1 year ago

Yep, that's definitely it:

inheritors-for-reversions

jeromekelleher commented 1 year ago

We should add some diagnostic plots to the TreeInfo for this I think this historgram is useful, and also a scatter plot of path-distance to the parent vs number of inheritors would be really useful.

hyanwong commented 1 year ago

I'm slow on the uptake here. What is an "inheritor" vs a descendant?

jeromekelleher commented 1 year ago

An inheritor is a sample that directly inherits that specific mutation, and a descendant is any sample descending from the mutation's node at that site.

hyanwong commented 1 year ago

An inheritor is a sample that directly inherits that specific mutation,

Do you mean "without subsequent changes" - i.e. if a mutation reverts, does that reduce the number of inheritors? Otherwise I can't see why they would differ. Perhaps I'm being slow again, though.

jeromekelleher commented 1 year ago

I mean literally inheriting that specific mutation (object?), not the particular state change. So, mutation 10A>G has 2 inheritors and 2 descendants above, where as mutation 10C>A has 3 descendants and 1 inheritor.

szhan commented 1 year ago

Just want to note here that multiple folks (I recall specifically Katrina Lythgoe) have mentioned that some genomes with poorly covered bases are imputed using the reference sequence (Wuhan-Hu-1). This problem may be encountered more often among samples sequenced early on. I don't think those entries are replaced later on. We should figure out how to flag such samples and exclude them during initial ARG building. Perhaps we could exclude such imputed samples after an HMM run, if it is determined that many reversions are needed to explain them.

hyanwong commented 1 year ago

multiple folks... have mentioned that some genomes with poorly covered bases are imputed using the reference sequence

This point was brought up by Kelly Harris and Luca Ferrati too. We could look at how many non-reference reversions there are, over time or possibly by region (Luca thought that UK samples, for example, should be less likely to be imputed to the reference)

jeromekelleher commented 1 year ago

A quick way of finding some of these sequences may be to look at the sc2ts sample QC metadata. Presumably if they've been pre imputed, the number of masked sites will be small. We can then look at these samlles to see if they have excess reversions.

We should open a new issue to track I think?

szhan commented 1 year ago

Or samples that got lots of mutations going to a reference state? Not necessarily only reversions going back to a non-reference state. Also, I suspect that the imputed bases are concentrated in certain regions reported to suffer from amplicon dropouts. As I recall, the ARTIC V1 primer set, which was used to sequence lots of early samples, had quite a few documented dropouts (not just in Spike).

hyanwong commented 1 year ago

Yes, any mutations back to the reference state.