sars-cov-2-variants / lineage-proposals

Repository to propose and discuss lineages
42 stars 2 forks source link

Suggestion to mask S:31 (21653-21655) for KP.3.1.1 (and other S:31del lineages) #1808

Open aviczhl2 opened 1 month ago

aviczhl2 commented 1 month ago

There seems to be a large KP.3.1.1+S:S31F branch, while KP.3.1.1 shall have S:31del.

That branch is driven by Denmark seqs which do not handle S:S31del well. When Querying C12616T, A13121T, C21654T all seqs are from Denmark, clearly an artifact.

However, that branch seems to attract seqs with no coverage at S:31 positions now, and form a very large artifact S:S31F branch under KP.3.1.1

@AngieHinrichs. Suggest to mask the deleted part of KP.3.1.1(and other S:31del lineages) on usher.

https://nextstrain.org/fetch/genome-test.gi.ucsc.edu/trash/ct/subtreeAuspice20_genome_test_32f78_b095f0.json?label=id:node_5486714 image

FedeGueli commented 1 month ago

I m not sure it is a good idea. in case of recombination we won't see that . We know that it is an artifact, but lets see how @corneliusroemer and @angiehinrichs want to handle this.

aviczhl2 commented 1 month ago

I m not sure it is a good idea. in case of recombination we won't see that . We know that it is an artifact, but lets see how @corneliusroemer and @AngieHinrichs want to handle this.

The problem is that I see a lot of further lineages (KP.3.1.1+Spike) being placed under this, messing up the tree.

aviczhl2 commented 1 month ago

This problem also appears in KP.2.15 and LB.1.2 sub-trees now.

https://github.com/cov-lineages/pango-designation/issues/2711 https://github.com/cov-lineages/pango-designation/issues/2712

AngieHinrichs commented 1 month ago

C21654T is the (or a) defining mutation of a bunch of JN.1 sublineages:

If all of those are artefacts (?), then the lineages JN.1.20, JN.1.58.3, JN.2.1, KP.3.1.3 MK.1, and JN.1.1.7 (on big C11747T polytomy) need to be retracted because they have no other defining mutation.

grep S:.31F ~/github/pango-designation/lineage_notes.txt

JN.1.1.7        Alias of B.1.1.529.2.86.1.1.1.7, S:S31F, from sars-cov-2-variants/lineage-proposals#1157
KP.2.7  Alias of B.1.1.529.2.86.1.1.11.1.2.7, S:S31F, after T22795G
KP.2.14 Alias of B.1.1.529.2.86.1.1.11.1.2.14, S:G184V, S:S31F, from sars-cov-2-variants/lineage-proposals#1578
KP.3.1.3        Alias of B.1.1.529.2.86.1.1.11.1.3.1.3, S:S31F,from sars-cov-2-variants/lineage-proposals#1576
MK.1    Alias of B.1.1.529.2.86.1.1.11.1.3.1.6.1, S:S31F, UK
KP.3.4.1        Alias of B.1.1.529.2.86.1.1.11.1.3.4.1, S:S31F, N:A134V, ORF1a:E1192K
KP.4.2.1        Alias of B.1.1.529.2.86.1.1.11.1.4.2.1, S:S31F
JN.1.20 Alias of B.1.1.529.2.86.1.1.20, S:S31F, directly on JN.1 polytomy
MB.1    Alias of B.1.1.529.2.86.1.1.49.1.1, S:F456V (T22928G), S:S31F
JN.1.58.3       Alias of B.1.1.529.2.86.1.1.58.3, S:S31F
JN.2.1  Alias of B.1.1.529.2.86.1.2.1, S:S31F, Sweden/Australia
AngieHinrichs commented 1 month ago

S:31del (or S:S31del) is listed for several JN.1 descendant lineages, and I see that includes KP.3.1.1, but as far as I can see they don't seem to be ancestors or siblings of the lineages with S:S31F above (except for KP.2.14 above & KP.2.15 below, and KP.3.1.3 above & KP.3.1.1 below).

grep 31del ~/github/pango-designation/lineage_notes.txt

KP.1.1.3        Alias of B.1.1.529.2.86.1.1.11.1.1.1.3, C4999T, many with S:S31del, from sars-cov-2-variants/lineage-proposals#1502
KP.2.3  Alias of B.1.1.529.2.86.1.1.11.1.2.3, S:H146Q, ORF3a:K67N, S:31del, from sars-cov-2-variants/lineage-proposals#1459
KP.2.15 Alias of B.1.1.529.2.86.1.1.11.1.2.15, A10861G, S:31del, USA/Canada
KP.3.1.1        Alias of B.1.1.529.2.86.1.1.11.1.3.1.1, ORF1a:S4286C, C12616T, S:S31del, Spain, from sars-cov-2-variants/lineage-proposals#1563
KP.4.1.3        Alias of B.1.1.529.2.86.1.1.11.1.4.1.3, ORF1a:M598V, A5245G, S:31del 
LF.2    Alias of B.1.1.529.2.86.1.1.16.1.2, ORF1a:K247R, ORF3a:Y184H, many with S:S31del, from sars-cov-2-variants/lineage-proposals#1502
LF.4.1  Alias of B.1.1.529.2.86.1.1.16.1.4.1, S:31del, ORF1a:G519S, ORF8:Q29*, from sars-cov-2-variants/lineage-proposals#1590
MA.1    Alias of B.1.1.529.2.86.1.1.18.3.1, S:R190S, S:31del, from sars-cov-2-variants/lineage-proposals#1635

@corneliusroemer how have you been distinguishing between S:31del and S:S31F?

corneliusroemer commented 1 month ago

I think OP suggests to only mask 21653-21655 where we know that these are deleted. All S:31- branches that have been designated have at least one nuc substitution that define them - because Usher is blind to deletions. The nuc substitutions are how you're annotating these lineages I think @AngieHinrichs, is that correct?

S:S31F does show up independently and it's not usually a sequencing artefact - but I think it can happen that S:S31F branches get wrongly placed into S:S31- branches - something that shouldn't happen except for recombination (or very unlikely an exact insertion.

@aviczhl2 is right that Denmark struggles with indels. So in this case it is very likely an artefact that should be removed.

Whether to mask generally or not - I'm not sure. What we should really do is mask the position of this deletion in Danish sequences as the Danish pipeline seems to frequently call the deletion as S:S31F instead.

@AngieHinrichs re your question how I find the S:31- lineages: I usually query for that deletion in GISAID/covSpectrum and place those sequences with S:31del in Usher. I essentially do a manual ancestral inferrence of the state at position S:31- like that to find the node where the deletion likely started to appear. Of course it's not perfect since Usher is blind to deletions but it seems to work pretty well.

If you want to confirm that the designations are correct, you could create a simple TSV and drop it onto an Auspice view of an Usher subtree (Auspice can add additional metadata colorings from drag&dropped tsv/csv)

CSV would look like this for example:

strain_name, S31_genotype
Denmark/DCGC-686892/2024|OZ120425.1|2024-06-24, F
...

If you drop this, it would give you a new "coloring" called "S31_genotype" which one could use to find the branch on which the deletion appears to have started to happen.

Does that make sense?

corneliusroemer commented 1 month ago

By the way, I think that the reason that S:S31F is sometimes called instead of deletion is that the difference between S:S31F and deletion is the length of a stretch of T homopolymers:

Brave Browser 2024-08-07 17 23 13

Essentially, the difference between S:S31- and S:S31F is just whether the stretch of Ts is of length 3 or 6. So I can see how some pipelines might get that wrong, especially if the pipeline is not very pegged to a reference but more de-novo like, which is good to avoid bias to reference, but in this case causes a different type of artefact.

corneliusroemer commented 1 month ago

If all of those are artefacts (?), then the lineages JN.1.20, JN.1.58.3, JN.2.1, KP.3.1.3 MK.1, and JN.1.1.7 (on big C11747T polytomy) need to be retracted because they have no other defining mutation.

I don't think they are all artefacts - it's possible they are but unlikely, because it seems to be only the Danish sequences that have the miscalling of deletion -> F. Whenever there is a natural country distribution and multiple labs, it's unlikely that artefact is happening (in which case we should of course retract the lineage - but I haven't seen any convincing evidence to that end, but I also haven't looked at those again since designating).

aviczhl2 commented 1 month ago

S:31del (or S:S31del) is listed for several JN.1 descendant lineages, and I see that includes KP.3.1.1, but as far as I can see they don't seem to be ancestors or siblings of the lineages with S:S31F above (except for KP.2.14 above & KP.2.15 below, and KP.3.1.3 above & KP.3.1.1 below).

grep 31del ~/github/pango-designation/lineage_notes.txt

KP.1.1.3        Alias of B.1.1.529.2.86.1.1.11.1.1.1.3, C4999T, many with S:S31del, from sars-cov-2-variants/lineage-proposals#1502
KP.2.3  Alias of B.1.1.529.2.86.1.1.11.1.2.3, S:H146Q, ORF3a:K67N, S:31del, from sars-cov-2-variants/lineage-proposals#1459
KP.2.15 Alias of B.1.1.529.2.86.1.1.11.1.2.15, A10861G, S:31del, USA/Canada
KP.3.1.1        Alias of B.1.1.529.2.86.1.1.11.1.3.1.1, ORF1a:S4286C, C12616T, S:S31del, Spain, from sars-cov-2-variants/lineage-proposals#1563
KP.4.1.3        Alias of B.1.1.529.2.86.1.1.11.1.4.1.3, ORF1a:M598V, A5245G, S:31del 
LF.2    Alias of B.1.1.529.2.86.1.1.16.1.2, ORF1a:K247R, ORF3a:Y184H, many with S:S31del, from sars-cov-2-variants/lineage-proposals#1502
LF.4.1  Alias of B.1.1.529.2.86.1.1.16.1.4.1, S:31del, ORF1a:G519S, ORF8:Q29*, from sars-cov-2-variants/lineage-proposals#1590
MA.1    Alias of B.1.1.529.2.86.1.1.18.3.1, S:R190S, S:31del, from sars-cov-2-variants/lineage-proposals#1635

I'm not suggesting to mask S31F for everything on JN.1. In fact it is one of the beneficial convergent mutations that appear many times in real world. I'm only suggesting to mask S:S31F for lineages that are defined by S:S31-, as S31F on these lineages are clearly artefacts.

These designated S31- lineages are (for now):

C28714T branch of KP.2.3 KP.2.15 LB.1 except for LB.1.8 KP.3.1.1 KP.1.1.3 KP.4.1.3 MA.1 LF.2, LF.4.1 and LF.1.1.1 XDY

Please mask 21653-21655 for seqs on these lineages. (or alter the Danish 31F seqs belonging to these lineages to 31-)

aviczhl2 commented 4 weeks ago

@AngieHinrichs It is causing more and more trouble now.

usher For example, almost every lineage on KP.3.1.1 now has a "back-up branch" on the S31F artefact branch. image

FedeGueli commented 4 weeks ago

I think the queries don't miss any of them, the only real issue will be if a fast lineage emerges in Denmark that extensively misses 31del.

aviczhl2 commented 4 weeks ago

I think the queries don't miss any of them, the only real issue will be if a fast lineage emerges in Denmark that extensively misses 31del.

Query won't miss but the usher tree will be very messy given each lineage being separated at 2 different places.

aviczhl2 commented 4 weeks ago

@AngieHinrichs This bug is more harmful than normal artefacts.

Normal artefacts can only attract seqs without coverage at that position, seqs with correct coverage won't be affected unless a stable Flip-flop reversion branch is formed.

However, this bug can attract ALL SEQS as ALL SEQS have no coverage at S:31(because it is deleted), making the bug more serious in theory.

FedeGueli commented 4 weeks ago

To me the bug is not that serious, the tree can attract S:S31F only sequences not all. and masking it could instead hide a real recombination event being a lot of lineages expanding with 31P and 31F.

aviczhl2 commented 4 weeks ago

To me the bug is not that serious, the tree can attract S:S31F only sequences not all. and masking it could instead hide a real recombination event being a lot of lineages expanding with 31P and 31F.

Nay. The tree can attract all 31del seqs as they have no coverage at S:31. No coverage=can place at anywhere. I believe usher work this way @AngieHinrichs . It does not only attract 31F seqs.

For example, #1881 is attracted despite not having S:S31F.

corneliusroemer commented 4 weeks ago

Agreed with @aviczhl2. I wonder though: why hasn't Usher simply inferred 31 to be F already for all of KP.3.1.1? That would be the parsimonious solution. Reason is that some KP.3.1.1 are wrongly called reference (instead of N or deletion) - and that artefact is more common than the Danish one. Right?

I agree masking would make total sense due to the fact there'll be massive messiness that will only increase.

aviczhl2 commented 4 weeks ago

Agreed with @aviczhl2. I wonder though: why hasn't Usher simply inferred 31 to be F already for all of KP.3.1.1? That would be the parsimonious solution. Reason is that some KP.3.1.1 are wrongly called reference (instead of N or deletion) - and that artefact is more common than the Danish one. Right?

I agree masking would make total sense due to the fact there'll be massive messiness that will only increase.

Let me explain.

1:There's not many real S31F artefacts for S31del branches. 2: Usher will try to fill in mutations for seqs on codons with no coverage 3: All 31del seqs do not have coverage on S:31, as it is deleted. 4: 2+3=>Usher cannot handle deletions. It simple thinks seqs from 3 have missing coverage on S:31 and try to fill in mutations for them. 5: 3+4=>All seqs on KP.3.1.1(and other 31del branches) can be filled in either 31F artefact branches or normal branches that does not include any mutation on S:31. 6: 5 causes seqs to split at two positions, resulting a messy tree.

@AngieHinrichs

corneliusroemer commented 4 weeks ago

@aviczhl2 that's not enough to explain screw up. Because if it was always either deletion or F, Usher would infer everything to be F and there would be no messy tree.

The requirement for messiness is that both types of artefacts exist here: wild type and F, instead of the correct deletion.

As long as it's only deletion plus one other base, it's ok, it will infer the base. The messiness here comes due to 2 base artefacts occuring.

Does that make sense?

aviczhl2 commented 4 weeks ago

@aviczhl2 that's not enough to explain screw up. Because if it was always either deletion or F, Usher would infer everything to be F and there would be no messy tree.

The requirement for messiness is that both types of artefacts exist here: wild type and F, instead of the correct deletion.

As long as it's only deletion plus one other base, it's ok, it will infer the base. The messiness here comes due to 2 base artefacts occuring.

Does that make sense?

Yeah. I think you're right. There is also the traditional base-filling artefacts that fills S for 31del.