sars-cov-2-variants / lineage-proposals

Repository to propose and discuss lineages
42 stars 1 forks source link

KP.3 + S:T22I, ORF3a:V77T (3-nuc) (55 seq, Jun 28) #1610

Open ryhisner opened 3 weeks ago

ryhisner commented 3 weeks ago

Description Sub-lineage of: KP.3 Earliest sequence: 2024-3-9, Tasmania — EPI_ISL_19140364 Most recent sequence: 2024-6-16, Canada, Alberta — EPI_ISL_19222338 Continents circulating: North America (28), Oceania (27) Countries circulating: Canada (25), Australia (19), New Zealand (8), USA (3) Number of Sequences: 55 GISAID Nucleotide Query: T25620C, G25621A, T25622C, -T25119A CovSpectrum Query: [3-of: T25620C, G25621A, T25622C] & !T25119A Substitutions on top of KP.3: Spike: T22I ORF3a: V77T Nucleotide: C21627T, T25620C, G25621A, T25622C, G29554T (3' UTR)

Phylogenetic Order of Mutations: All at once

USHER Tree https://nextstrain.org/fetch/raw.githubusercontent.com/ryhisner/jsons2/main/KP.3_ORF3a.V77T__S.T22I.json?c=gt-ORF3a_77&gmax=26220&gmin=25393&label=id:node_5611311

image

Evidence I don't think this is a particularly fast lineage—the first sequence appeared on March 9—the first KP.3 ever in Australia—so it was present from very early on. But the 3-nucleotide mutation in ORF3a is interesting.

image

The first thing I think of when I see multi-nucleotide mutations are recombinations that form TRS or TRS-like motifs. Probably a majority of multi-nuc mutations consist of these. This one doesn't resemble the TRS-L in the slightest, however. The next thing I look for is if it was obtained from another portion of the viral genome, presumably through some sort of recombination. But there are no matching sequences or matching reverse-complement sequences in the SARS-CoV-2 genome for the sequence this 3-nuc mutation forms.

The last thing I usually look for is RdRp "stutters," which more often result in insertions but can sometimes cause a multi-nuc mutation as well. Usually these occur where there is already a repeat trinucleotide sequence—like TATTAT or CAACAA, for example. The most common mutation in these cases is a 3-nuc insertion that creates a 3rd repeat. We see this in the Q-dense region of N, specifically at N:239-242, where four consecutive Q's form CAACAACAACAA; in ORF1b:824-825, where the double-D GATGAT occasionally results in another GAT repeat with ORF1b:ins823D (as in XBC); in NSP2, and elsewhere.

image

That's what seems to have happened here, but it's unusual for couple reasons. First, it's a 4-nucleotide repeat, which is unusual. Second, there weren't any repeats there to begin with. Across ORF3a:78-79 (25624-25627), there is a CACT sequence. This seems to have been repeated at the four nucs directly upstream (25620-25623) but without causing an insertion; the next four nucs (the RdRp copies the positive strand "backwards") were just overwritten.

Notably, there's a three-sequence KP.3.3 branch that also has a multi-nuc mutation very near this one: T25629C, G25630C. Somewhat further downstream, a few LD.1 sequences have C25688T, T25689C.

image

ORF3a:103 has always been a hotspot for insertions as well (always CCC or CCCCC). I'm guessing there's something about the secondary RNA structure of this region that makes it liable to these sort of RdRp errors. It looks like a very "bubbly" region, with lots of unpaired and weakly paired nucleotides.

image

Also, ORF3a is pretty tolerant of mutations.

Genomes

Genomes EPI_ISL_19059053, EPI_ISL_19085239, EPI_ISL_19108010, EPI_ISL_19108072, EPI_ISL_19108078, EPI_ISL_19140364, EPI_ISL_19142246, EPI_ISL_19142249, EPI_ISL_19142360, EPI_ISL_19161640, EPI_ISL_19161699, EPI_ISL_19161746, EPI_ISL_19163166, EPI_ISL_19163296, EPI_ISL_19180455, EPI_ISL_19180457, EPI_ISL_19185024, EPI_ISL_19185765, EPI_ISL_19186259, EPI_ISL_19186864, EPI_ISL_19186996, EPI_ISL_19186998, EPI_ISL_19187204, EPI_ISL_19187475, EPI_ISL_19188022, EPI_ISL_19192878-19192879, EPI_ISL_19192912, EPI_ISL_19203077, EPI_ISL_19203091, EPI_ISL_19203260, EPI_ISL_19203264, EPI_ISL_19203327, EPI_ISL_19203329, EPI_ISL_19209614, EPI_ISL_19215280, EPI_ISL_19217906, EPI_ISL_19222072, EPI_ISL_19222098, EPI_ISL_19222131, EPI_ISL_19222140, EPI_ISL_19222168, EPI_ISL_19222183-19222184, EPI_ISL_19222186-19222187, EPI_ISL_19222194, EPI_ISL_19222205, EPI_ISL_19222295-19222297, EPI_ISL_19222309, EPI_ISL_19222318, EPI_ISL_19222338, EPI_ISL_19223774
ryhisner commented 3 weeks ago

Not related to this lineage, but it's relevant to ORF3a mutations/insertions in general, and I happened to notice four KP.3 uploaded today from France that happen to have ORF3a:L106F and ORF3a:ins103P. I mentioned above that a very common insertion is ORF3a:ins103P (called as ins25701_CCC by Nextclade). There are only four C's in a row here, something that occurs 12 times elsewhere in the genome—with three of those having five consecutive C's—so at first it seems confusing why this particular stretch of 4 C's should lead to so many RdRp-stutter, CCC insertions. But I'm almost certain it has to do with the extensive stretch of A-T nucleotides just downstream of the 25700-25703 "CCCC" motif, which extends from 25704-25719 and 13/16 of which are either A or T.

A-T-dense stretches of nucleotides pair more weakly with matching nucs (whether in the secondary RNA structure or in double-stranded RNA), which is why they're known as being "slippery," usually in the context of translation. A shorter stretch of A-T's is partially responsible for the famous frameshifting element (FSE) at the ORF1a/ORF1b boundary, which causes ribosomes to slip and change frames in a minority of translation passes. In the case of these ORF3a:ins103P insertions, I suspect RdRp-stuttering is a more likely explanation than the ribosome slipping by three (or occasionally six) nucleotides, though I'm not knowledgeable enough on this topic to say for certain.

image

When I first noticed this, I considered the idea rather speculative until I noticed something: sequences that have ORF3a:L106F (C25708T) have a hugely disproportionate share of ORF3a:ins103_P insertions.

image

Because C25708T increases the A-T content of the downstream 16 nucleotides from the CCCC motif from 13/16 to 14/16, this is exactly what you would expect if it was the slipperiness of those 16 nucleotides causing the insertions.

First comparison is between sequences that have the ins_25701:CCC insertion and those that don't. C25708T is overrepresented in these sequences more than 16-fold.

image

.

.

.

. This second comparison is between sequences with and without C25708T. In sequences with C25708T, ins_25701:CCC frequency is increased ~24-fold compared to sequences without C25708T. .

.

image
FedeGueli commented 3 weeks ago

Thanks Ryan for the analysis here.