jeromekelleher / sc2ts-paper

3 stars 5 forks source link

Origins of Alpha #225

Open hyanwong opened 2 days ago

hyanwong commented 2 days ago

This would be worth a line or two in the main text, and some digging for the supplementary. In the "maskreg-psv2-v1-mm_4-f500-clustloc-mrm_2-rw_10-mgs_10-2021-01-28" tree, it appears as if we can resolve the first of the Alpha mutations to matching against some samples sequenced in Cambridge (suspiciously geographically close to Kent, the origin of the alpha outbreak) in particular these samples, which are all B.1.1

{4598: 'ERR4413600', 6226: 'ERR4460507', 7812: 'ERR4460993', 1149: 'ERR4458709', 1150: 'ERR4458827', 8457: 'ERR4461558'}
Screenshot 2024-10-10 at 16 39 21

This is so early in the pandemic that we can easily re-run the matching up to that point. @jeromekelleher points out that node 4725 here is likely to be caused by a reversion push, so it is a result of the tree-building part of the sc2ts algorithm. It has 3 exact matches:

Node(id=4725, flags=4194304, time=318.250002625, population=-1, individual=-1, metadata={'sc2ts': {'date_added': '2020-03-31', 'num_exact_matches': 3, 'sites': [22909]}})

@szhan dug out a paper which also finds a Cambridge sample near the root of the alpha-defining lineages: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9752794/. We exclude their sample (CAMC-946506 / ERR4638413) because it has too much missing data: sc2ts.inference Filter ERR4638413: missing=2393 > 500, but it could be interesting to match it in afterwards.

Note that the taxonium tree below matches these 6 samples (circled) quite a long way from the root of alpha (alpha in cyan here). Shing suspects this is because of treating deletions as ancestral states.

tmp

Code to replicate ```python # will need to sub in any changed node IDs here extra_nodes = [u for u in ts.at(21563).children(4725) if ts.node(u).is_sample()] + [u for u in ts.at(21563).samples(1151)] svg = ti.draw_subtree( tracked_samples=[81660, 43835, 51309], # Pick a few selected early samples from "B.1.1.7" / alpha size=(1100, 2000), canvas_size=(1100, 2000), time_scale="time", extra_tracked_samples=extra_nodes, style=".plotbox {transform: translateX(20px)}.leaf > .lab {text-anchor: start; transform: rotate(90deg) translate(6px)}", ) svg ```
hyanwong commented 2 days ago

That paper also mentions the later MILK-B154B6, GISAID ID: EPI_ISL_2735517 (which I think corresponds to ERR4869224), concluding that it could be a recombinant or result of lab contamination. That also isn't in our dataset (and was presumably filtered out). This would be another interesting sequence to match in, after the fact.

jeromekelleher commented 2 days ago

2024-10-10 15:02:22 DEBUG sc2ts.inference Final HMM pass hmm_cost=12.0 ERR4869224 2020-10-23 B.1.1 path=(0:29904, 4462) mutations(12)=[445T>C, 3264C>T, 5986C>T, 6286C>T, 6808T>C, 12247T>C, 15279C>T, 15775A>T, 23604C>A, 23709C>T, 25455G>T, 28977C>T]

It's from 2020-10-23, and I think it just was a bit too early to gather up any other Alpha samples be within the window for inclusion. Would be interesting to see what would happen from matching it back later and adding it in all right.

szhan commented 1 day ago

Looking at the input sequence of ERR4869224, which has a decent number of Ns in these regions. ORF1a:6866-7054 (which contains defining mutation T6954C:I2230T) ORF1ab:21428-21458 S:22339-22523

I wonder how many imputed bases in at those sites (based on the best match) would be different than the ref. If all those Ns were supposed to be the same as ref., then we could have quite different placement in the trees.