Open hyanwong opened 1 month ago
Note that SRR11810706 is from Gujarat, and we know that Delta originated in India, so this might be believable.
The newer all-sample ARG, as of this morning ("maskdel-v1-mm_4-f_500-mrm_2-mms_5-mrec_2-rw_7-mgs_10-2021-08-28") has no major recombination problems, but a different problem with the start of Delta. In particular, it infers 2 separate Delta origins, one comprising about 2.3rds of the AY- lineages plus B.1.617, B.1.617.1, and B.1.617.2, the other comprising about 1/3rd of the remaining AYs. There are loads of parallel mutations on the stem leading to each clade. This is clearly wrong, and we should try to figure out why, and check that it doesn't happen in future ARGs.
@jeromekelleher dug into the logs and saw two retro groups being added on the same day, which could be the source of this:
2024-10-19 07:01:34 WARNING sc2ts.inference Add retro group {'B.1.617.2': 21, 'AY.38': 2, 'AY.9': 2}: samples=25 depth=3 total_muts=53 root_muts=11 muts_per_sample=2.12 recurrent_muts=1
2024-10-19 07:01:34 WARNING sc2ts.inference Add retro group {'B.1.617.2': 9, 'AY.122': 1, 'AY.1': 1}: samples=11 depth=2 total_muts=40 root_muts=14 muts_per_sample=3.6363636363636362 recurrent_muts=0
Alternatively, it could be some of the tweaked HMM parameters.
Here's a plot subset down to about 30 AY.4 samples (cyan), one of which is an outlier and groups under a recomination node on the far right. The others are all in the 1/3rd clade, which is independently picking up the same mutations that lead to the bulk of the delta-origin "B.1.617.2" samples (in orange, below):
Here's the main plot of all AY lineages, with Delta (bottom right) showing 3 independent origins (urgh):
An important thing to note here is that these two retrogroups are clearly a mix of time travelling lineages. Most retro groups consist of just one pango lineage (indicating that we're picking up the origin of that lineage). A mixture of lineages indicates potential problems. A mixture of highly distinct lineages across many months (here) indicates time travel an big trouble.
Good point about a mix of lineages. In real time we might not have the lineage information for new samples (lineages may not have been devised yet), but in that case we shouldn't have so many time travelling problems either.
The seeding method, using strain XXXX, seems to give OK results: at least, it creates only one tree (no recombination), has no reversions and few recurrent mutations, and doesn't lead to multiple origins for delta.
We don't have any In this test ARG (delta_wave_seeded_v3_hmm_cost_7-2021-06-07.ts.tsz
) there is a single sample, ERR5965862 (node 156987) which comes off first, but is about 60 days later than the delta node:
ERR5965862 is separated from the root of all other deltas, 142724, by a single mutation, A11201G, which is not a reversion or anything, so maybe that's OK? Perhaps a sample worth looking at (e.g. can we map to GISAID and get a sample submission date).
The nodes under 142724 show recurrent mutations (G21987A (1/5) and C21846T (1/6)), which seem a bit suss to me. It could be worth looking into what's going on there, and whether different seed samples would change anything.
Thanks Yan, super helpful. The seed sample here was ERR5876690.
ERR5965862 happens to be one of the Delta samples that arrived soon after the seed sample was added, and it matched to it with 2 mutations [T10651C, G11201A]. G11201A is an immediate reversion, and a reversion push node was therefore created which became the ultimate "Delta node".
The single side shoot branch here is just a function of chance I think, and odd stuff like that's going to happen. The recurrent mutations aren't brilliant, but I think we can live with that.
It really is quite hard to find a good starting point with all the noise, so unless there's something badly wrong with this proposal I think we should stick with it.
Here's a useful paper talking about Delta sequences in the UK: https://www.nature.com/articles/s41586-022-05200-3
We are trying to find a decent Delta seed, that is not a time traveller. I see an article that mentions the earliest Delta in GISAID is on 5th Oct 2020 (see https://www.sciencemediacentre.org/expert-reaction-to-cases-of-variant-b-1-617-the-indian-variant-being-investigated-in-the-uk/). @szhan: would it be possible to locate the GISAID submission that is discussed in that article? It seems to be from Maharashtra state.
This paper uses EPI_ISL_1360382, but that's from 2021, I think.
It also says "Most isolates sequenced by India originated from Maharashtra and West Bengal, but B.1.617 has been identified in several other states.", so we could potentially find other believable seeds by looking for submissions in mid-oct from those states?
This could be useful, from https://pubmed.ncbi.nlm.nih.gov/33961693/. The preprint (https://www.biorxiv.org/content/10.1101/2021.04.23.441101v1.full.pdf) sounds like it could be helpful.
In the ARG as of 13th Oct 2024, at the start of the spike there is a single clade that comprises the delta samples ("B.1.617.2" or "AY.*"), which contains all 43028 delta samples plus two B.1-assigned samples (SRR23110826 (node 17569) and SRR11810706 (node 12805), both of which seem to be good candidates for samples close to the origin of delta (orange, below).
However, the origins of delta are a little messy, with reversions (in magenta) crammed into a short branch below node 140425 (expanded below), and several separate lineages. We think this is probably because of adding different retrospective groups from different countries. We should check whether this is improved in the next ARG iteration.
Note that nodes 140425 and its sister 147202 are recombination nodes (larger open circle used)
For huge clades like this, it can be helpful to subsample a few delta nodes, as per below