jeromekelleher / sc2ts-paper

3 stars 5 forks source link

Origins of Alpha #225

Open hyanwong opened 1 month ago

hyanwong commented 1 month ago

This would be worth a line or two in the main text, and some digging for the supplementary. In the "maskreg-psv2-v1-mm_4-f500-clustloc-mrm_2-rw_10-mgs_10-2021-01-28" tree, it appears as if we can resolve the first of the Alpha mutations to matching against some samples sequenced in Cambridge (suspiciously geographically close to Kent, the origin of the alpha outbreak) in particular these samples, which are all B.1.1

{4598: 'ERR4413600', 6226: 'ERR4460507', 7812: 'ERR4460993', 1149: 'ERR4458709', 1150: 'ERR4458827', 8457: 'ERR4461558'}
Screenshot 2024-10-10 at 16 39 21

This is so early in the pandemic that we can easily re-run the matching up to that point. @jeromekelleher points out that node 4725 here is likely to be caused by a reversion push, so it is a result of the tree-building part of the sc2ts algorithm. It has 3 exact matches:

Node(id=4725, flags=4194304, time=318.250002625, population=-1, individual=-1, metadata={'sc2ts': {'date_added': '2020-03-31', 'num_exact_matches': 3, 'sites': [22909]}})

@szhan dug out a paper which also finds a Cambridge sample near the root of the alpha-defining lineages: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9752794/. We exclude their sample (CAMC-946506 / ERR4638413) because it has too much missing data: sc2ts.inference Filter ERR4638413: missing=2393 > 500, but it could be interesting to match it in afterwards.

Note that the taxonium tree below matches these 6 samples (circled) quite a long way from the root of alpha (alpha in cyan here). Shing suspects this is because of treating deletions as ancestral states.

tmp

Code to replicate ```python # will need to sub in any changed node IDs here extra_nodes = [u for u in ts.at(21563).children(4725) if ts.node(u).is_sample()] + [u for u in ts.at(21563).samples(1151)] svg = ti.draw_subtree( tracked_samples=[81660, 43835, 51309], # Pick a few selected early samples from "B.1.1.7" / alpha size=(1100, 2000), canvas_size=(1100, 2000), time_scale="time", extra_tracked_samples=extra_nodes, style=".plotbox {transform: translateX(20px)}.leaf > .lab {text-anchor: start; transform: rotate(90deg) translate(6px)}", ) svg ```
hyanwong commented 1 month ago

That paper also mentions the later MILK-B154B6, GISAID ID: EPI_ISL_2735517 (which I think corresponds to ERR4869224), concluding that it could be a recombinant or result of lab contamination. That also isn't in our dataset (and was presumably filtered out). This would be another interesting sequence to match in, after the fact.

jeromekelleher commented 1 month ago

2024-10-10 15:02:22 DEBUG sc2ts.inference Final HMM pass hmm_cost=12.0 ERR4869224 2020-10-23 B.1.1 path=(0:29904, 4462) mutations(12)=[445T>C, 3264C>T, 5986C>T, 6286C>T, 6808T>C, 12247T>C, 15279C>T, 15775A>T, 23604C>A, 23709C>T, 25455G>T, 28977C>T]

It's from 2020-10-23, and I think it just was a bit too early to gather up any other Alpha samples be within the window for inclusion. Would be interesting to see what would happen from matching it back later and adding it in all right.

szhan commented 1 month ago

Looking at the input sequence of ERR4869224, which has a decent number of Ns in these regions. ORF1a:6866-7054 (which contains defining mutation T6954C:I2230T) ORF1ab:21428-21458 S:22339-22523

I wonder how many imputed bases in at those sites (based on the best match) would be different than the ref. If all those Ns were supposed to be the same as ref., then we could have quite different placement in the trees.

hyanwong commented 1 month ago

The all-samples ARG is showing much the same patterns for Alpha, with a GAT->CTA triplet at the base of B.1.1.7 (cyan samples). We have removed the deletions here, but there are still lots of alpha-specific mutations, most of which are unique.

I'm suspicious about the C3267T recurrent mutation just below the alpha root (node 83297).

Screenshot 2024-10-21 at 15 34 08
jeromekelleher commented 1 month ago

I'm suspicious about the C3267T recurrent mutation just below the alpha root (node 83297).

Yes, this is odd, good spot. Let's keep an eye out for this on the next (final??) iteration. I'm suspicious that the HMM current settings will lead to slightly suboptimal HMM matches for really unlikely matches, and it may look something like this.

hyanwong commented 1 month ago

It would be nice to be able to show, graphically, where the two "intermediate" samples (ERR4869224 and ERR4638413) which we excluded fit onto the tree, and how they would shuffle the order of mutations on that long branch. The routine that @jeromekelleher coded at https://github.com/jeromekelleher/sc2ts/pull/360 should do most of it, but maybe not the mutation pushing etc required to change the mutation order.

I assume we'd either match these sequences to the top or bottom node of that long branch, and then change the mutation ordering using reversion pushes, etc.

jeromekelleher commented 1 month ago

If you run the match and post the results we can figure out what inserting would do.