jeromekelleher / sc2ts

Infer a succinct tree sequence from SARS-COV-2 variation data
MIT License
5 stars 3 forks source link

Update problematic sites for viridian data #231

Closed jeromekelleher closed 1 month ago

jeromekelleher commented 3 months ago

We currently have sites that are being masked for all samples (going by the max_masked_samples_per_site values). Investigate, and update to include any sites that are always masked under the protocol.

There will likely be some sites that also need to be added because of #228

jeromekelleher commented 2 months ago

Based on the Viridian ARG inferred up to 2021-06-30 (same date as the Wide ARG from the preprint) we have the following data:

site-outliers

That is, there are 15 sites with more than 500 mutations at this time point, and these account for 40% of the total reversion mutations and 25% of the immediate reversions. I'm going to try running these ARG inference now with these sites excluded.

jeromekelleher commented 2 months ago

15 sites is 0.05% of the total currently used, so I wouldn't fret too much about losing data here.

szhan commented 1 month ago

I'm digging a bit deeper into Bloom et al.'s analysis of mutational spectra. They excluded some sites that experience high mutation frequencies, some of which are not in our current list (commit: 28d4231de54ebeebd5520587166e71f03268f12f). Also, some of them are clade-specific.

They explained that these are probably enriched for sequencing or bioinformatics errors (e.g. imputed to the reference). None of these sites seem to be in the labelled ones above, so they were most likely cleaned up in the Viridian dataset.

Nothing to do with them now, but I'm just tagging the list here in case one of the sites crop up in our newer ARGs (see sites_to_exclude). https://github.com/jbloomlab/SARS2-mut-spectrum/blob/main/config.yaml

jeromekelleher commented 1 month ago

Note I didn't actually update the set of problematic sites, but am using an additional list provided to the CLI. These are

56
57
58
59
60
7851
10323
11750
17040
21137
21846
22917
22995
26681
27384
27638
27752
28254
28271
29614

I got rid of 56-61 because they were masked out in a very large fraction of the alignments.

szhan commented 1 month ago

Just to follow up on this. These sites are in the Bloom list but in the our current problematic site list or the list of additional sites above.

1149
5629
6851
7328
13947
28095
29362
jeromekelleher commented 1 month ago

Do they show up in our QC plots, as either highly mutated or masked?

szhan commented 1 month ago

From Slack, in case it gets lost.

Just noticed that four of the additional sites (the "mutation outliers") are where lineage defining mutations for Delta (B.1.617.2) in the Freya occur. Three of them are found in a bunch of other variants' lineage defining mutations. But one of them (27638) seems to occur mostly in Delta, Delta subvariants, and closely related variants (B.1.617, B.1.617.1, B.1.617.3).
szhan commented 1 month ago

Keen to reexamine all these sites again, but alongside the number of deleted sites per sample.

szhan commented 1 month ago

Including these two regions to the list of additional problematic sites.

Region: NTD domain Coords: [21602-22472) Multiple highly recurrent deleted regions in NTD domain in Spike https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7971772/

Region: ORF8 Coords: [27939-28257) Interesting things happen, e.g. truncation mutations, out-of-frame deletions, and even loss of the entire gene. Also, an analysis of recurrently deleted sites in clipends-v2-mm_4-noexact-f500-clustloc-mrm_2-rw_10-mgs_10-2021-04-10.ts.tsz shows that site 28254 has 143 1-bp private deletion events. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7374062/ https://virological.org/t/repeated-loss-of-orf8-expression-in-circulating-sars-cov-2-lineages/931/1

jeromekelleher commented 1 month ago

For ORF8 the genetic map we have says it covers coords [27894, 28260), but we have ORF8: [27939-28257) here.

Which should we use?

szhan commented 1 month ago

Let's go with [27894, 28260) to be consistent, since we are sticking with the current coordinates.

[27939-28257) is from Swiss-Prot, and it differs by a stop codon.