sars-cov-2-variants / lineage-proposals

Repository to propose and discuss lineages
42 stars 2 forks source link

XCM: XBB.2.3.13/DV.7.1 recombinant, with DV.7.1 Spike (8 seqs, Spain,Italy, Luxembourg) #674

Closed corneliusroemer closed 1 year ago

corneliusroemer commented 1 year ago

This Spanish sequence looks like an XBB.2.3/DV.7.1 recombinant with breakpoint exactly between 22017-22031 (both ends inclusive).

Seems to be missing from Usher tree due to QC? @AngieHinrichs

hCoV-19/Spain/IB-HUSE-08555/2023|EPI_ISL_18106845|2023-08-08

image image

Singlet when placed in Usher. It appears in a meaningless place (BA.4_dropout)

image

(EDITED): Thanks to @Sinickle who has found more samples of this recombinant (maybe he wants to add more on that).

I found a new query that catches all 5 sequences: , T13560C, T17661C, T25959C

Samples IDs: EPI_ISL_18106845, EPI_ISL_18118213, EPI_ISL_18215936, EPI_ISL_18218415, EPI_ISL_18220496

New Tree:

Schermata 2023-09-05 alle 19 07 53

https://nextstrain.org/fetch/genome.ucsc.edu/trash/ct/subtreeAuspice1_genome_3491d_75b310.json?label=id:node_10802729

AngieHinrichs commented 1 year ago

Seems to be missing from Usher tree due to QC? @AngieHinrichs

Yep, it's filtered out for having too many reversions (6 > 5). I will add an exception for it and it should be added in the 2023-08-25 build.

AngieHinrichs commented 1 year ago

If there are others like it (with >5 reversions relative to nextclade placement), they will be filtered out too until I add exceptions for them, so please let me know if more appear.

aviczhl2 commented 1 year ago

+1 Italy EPI_ISL_18215936 @AngieHinrichs

FedeGueli commented 1 year ago

Possible query: C25721T,G331T, T13560C, T17661C finds 3 : EPI_ISL_18106845, EPI_ISL_18118213, EPI_ISL_18215936 But looking a bit deepr into it i am not sure they are all the same recombinant look at orf1ab: Schermata 2023-09-05 alle 14 30 09 Top is Cornelius' one Mid is fede's one Bottom is @aviczhl2 one

(i ave edited with this the proposal above.

aviczhl2 commented 1 year ago

EPI_ISL_18118213 seems to have a terrible Spike?

FedeGueli commented 1 year ago

EPI_ISL_18118213 seems to have a terrible Spike?

yeah. all fshifted

FedeGueli commented 1 year ago

@Sinickle found more sequences belonging to this recombinant (maybe he wants to add more on that). Thanks to his tree: https://nextstrain.org/fetch/genome.ucsc.edu/trash/ct/subtreeAuspice1_genome_25eac_742490.json?c=userOrOld&label=id:node_7046697

I found a new query that catches all 5 sequences: , T13560C, T17661C, T25959C

Samples IDs: EPI_ISL_18106845, EPI_ISL_18118213, EPI_ISL_18215936, EPI_ISL_18218415, EPI_ISL_18220496

(added this comment to the main proposal)

FedeGueli commented 1 year ago

4 out of 5 sequences seem to have Orf1a.T1822I (C5730T) = NSP3_T1004I = PLPro_T259I

PLPro_T259I seems to have some important role in this paper by Ferreira et Al Schermata 2023-09-05 alle 18 52 40

AngieHinrichs commented 1 year ago

Samples IDs: EPI_ISL_18106845, EPI_ISL_18118213, EPI_ISL_18215936, EPI_ISL_18218415, EPI_ISL_18220496

Thanks! I added exceptions but today's build is already underway so they should go in tomorrow's build, 2023-09-06.

Sinickle commented 1 year ago

I don't have too much more to add --

I feel like this could quite plausibly be a more fit form of DV.7.1 -- S:F157L is a fitness improving mutation on XBB (ie, convergent and measured to improve growth rate advantage), so the breakpoint being between S:152 and S:157 could reasonably be one of the more advantageous possibilities.

DV.7.1 still looks pretty strong in Asturias Spain which I believe has had its earliest/strongest presence. With this proposed recombinant being at 5 sequences with all recent sequences and 3 countries, I personally feel like this is one of the top variants to be keeping a watch on/should be designated.

image

aviczhl2 commented 1 year ago

Will further seqs continue to be removed if we don't designate?

The 5-reversion rule starts counting from a designated lineage, right?

corneliusroemer commented 1 year ago

Just noticed it again when looking at uploads from today. Thanks for the discussion here, everyone!

Designated as XCM

corneliusroemer commented 1 year ago

We can actually narrow down the XBB.2.3 donor pretty well! All 5 private mutations on that side up to breakpoint on top of XBB.2.3 occurred in a few sequences from Spain/France!

This is the donor, XBB.2.3.13:

image image

Donor presence in Spain matches with fact that the other donor DV.7.1 is most common in Spain, and that the first sequence is from Spain.

FedeGueli commented 1 year ago

Grear catch! Didnt check it went from singlet to lineage quite fastly!

corneliusroemer commented 1 year ago

The other side can be narrowed down as well to this branch of DV.7.1:

image image

https://next.nextstrain.org/fetch/genome.ucsc.edu/trash/ct/subtreeAuspice1_genome_e590_789090.json?label=id:node_7691732

FedeGueli commented 1 year ago

The other side can be narrowed down as well to this branch of DV.7.1: image

image https://next.nextstrain.org/fetch/genome.ucsc.edu/trash/ct/subtreeAuspice1_genome_e590_789090.json?label=id:node_7691732

This parental dv.7.1 branch was transferred to main page today i think

AngieHinrichs commented 1 year ago

The 5-reversion rule starts counting from a designated lineage, right?

I use command-line nextclade's very detailed output for filters, so it starts from Omicron lineages that are recognized by nextclade. Nextclade is pretty much always more up to date than pangolin and often includes lineages that have not yet been in a pango-designation release (unlike pangolin-data), but it may not necessarily have all of the most recently designated lineages (especially because I need to remember to update nextclade data when there is a release).

Over-There-Is commented 1 year ago

The other side can be narrowed down as well to this branch of DV.7.1: image image https://next.nextstrain.org/fetch/genome.ucsc.edu/trash/ct/subtreeAuspice1_genome_e590_789090.json?label=id:node_7691732

This parental dv.7.1 branch was transferred to main page today i think

Yes, it is cov-lineages/pango-designation#2258

Sinickle commented 1 year ago

A sequence in Finland yesterday and another in Spain today, meaning that the last 6 sequences (there's only 7 total) were uploaded in the last 5 days, in 5 different countries.

EDIT: A sequence in Finland yesterday and another in Spain and Italy today, meaning that the last 7 sequences (there's only 8 total) were uploaded in the last 5 days, in 5 different countries.

FedeGueli commented 1 year ago

A sequence in Finland yesterday and another in Spain today, meaning that the last 6 sequences (there's only 7 total) were uploaded in the last 5 days, in 5 different countries.

def one to watch!

JosetteSchoenma commented 12 months ago

A new one from the Netherlands. EPI_ISL_18263413. It misses T25959C though. And one from Austria. EPI_ISL_18262219. I used this query: C25721T, T13560C, T17661C. image

aviczhl2 commented 12 months ago

It seems that additional XCM seqs are still removed even after designation.

AngieHinrichs commented 12 months ago

It seems that additional XCM seqs are still removed even after designation.

Yep. I will need to manually include sequences until the following things happen:

  1. nextclade data for SARS-CoV-2 is updated to include XCM
  2. I update nextclade (this will make sure new sequences aren't excluded)
  3. I re-run nextclade on older sequences

So in the meantime if you could periodically send a list of names or IDs that would be super helpful! I see EPI_ISL_18263413 and EPI_ISL_18262219 in Josette's note above and will add those now. They should appear in the 2023-09-20 tree.

aviczhl2 commented 12 months ago

It seems that additional XCM seqs are still removed even after designation.

Yep. I will need to manually include sequences until the following things happen:

  1. nextclade data for SARS-CoV-2 is updated to include XCM
  2. I update nextclade (this will make sure new sequences aren't excluded)
  3. I re-run nextclade on older sequences

So in the meantime if you could periodically send a list of names or IDs that would be super helpful! I see EPI_ISL_18263413 and EPI_ISL_18262219 in Josette's note above and will add those now. They should appear in the 2023-09-20 tree.

I see manual labels like BA.4-dropout of XBB.1.5_17124 existing on tree. Will the problem be solved if you add a manual label to XCM?

AngieHinrichs commented 11 months ago

nextclade is the source of the number of reversions used for filtering, and the nextclade tree is built independently of the UShER tree. Those manual labels are in the UShER tree, not in the nextclade tree. They are for the benefit of Pango, i.e. when I make a minimized tree for use in pangolin, those labels help to identify the correct lineages of sequences that belong to a lineage but may have an odd placement in the UShER tree.

aviczhl2 commented 11 months ago

@AngieHinrichs It seems still buggy on XCM usher placement, most seqs are misread to have XBB.1 mutations Q183E, G252V, L368I, V445P,F490S. Resulting real XCM seqs(the 2 Denmark ones) to have 6 "reversions".

usher image

AngieHinrichs commented 11 months ago

Yes, sorry, the accuracy of the tree's mutations for recombinants suffers because I apply branch-specific masking of common artifact reversions, so true reversions and mutations lost by recombination are ignored in the tree and it appears that they have mutations that they don't.

You can see the long list of positions (or specific reversions) that I mask out in the file branchSpecificMask.yml, for example in XBB (starting on line 137) I mask out the reversions G22109C (S:183), A22664C (S:368), C22895G and C22896T (S:445), and C23031T (S:490). In XBB.1 (> G27915T) I mask out the reversion of the XBB.1-defining T22317G (S:252).

It's a trade-off -- if I don't mask reversions, then they can create a lot of false branching in the tree, so I mask when I see trouble. But when I mask them, the tree has incorrect mutations for true reversions at those positions and for many recombinants.

FedeGueli commented 11 months ago

XCM

JosetteSchoenma commented 10 months ago

I am a bit lost on what is going on with the XCM Usher tree. Newly uploaded ones are on a branch with Spike reversions, while they are already there with mutations showing that they do not actually have, like S:F490S. See this EPI_ISL_18403533, for example. But several from GenBank without an EPI_ISL number. @AngieHinrichs image image image These are the 62 I get with my query. 9 new Dutch ones today. EPI_ISL_18106845 EPI_ISL_18118213 EPI_ISL_18215936 EPI_ISL_18218415 EPI_ISL_18220496 EPI_ISL_18228332 EPI_ISL_18234105 EPI_ISL_18234460 EPI_ISL_18262219 EPI_ISL_18263413 EPI_ISL_18273371 EPI_ISL_18290110 EPI_ISL_18290697 EPI_ISL_18295453 EPI_ISL_18313581 EPI_ISL_18313714 EPI_ISL_18315600 EPI_ISL_18329964 EPI_ISL_18329980 EPI_ISL_18331311 EPI_ISL_18331338 EPI_ISL_18331345 EPI_ISL_18331346 EPI_ISL_18331735 EPI_ISL_18338530 EPI_ISL_18356463 EPI_ISL_18362762 EPI_ISL_18363960 EPI_ISL_18366537 EPI_ISL_18367143 EPI_ISL_18367151 EPI_ISL_18370443 EPI_ISL_18375963 EPI_ISL_18377320 EPI_ISL_18378282 EPI_ISL_18378287 EPI_ISL_18378294 EPI_ISL_18378311 EPI_ISL_18378825 EPI_ISL_18383761 EPI_ISL_18386740 EPI_ISL_18398287 EPI_ISL_18398491 EPI_ISL_18400049 EPI_ISL_18403514 EPI_ISL_18403521 EPI_ISL_18403533 EPI_ISL_18406416 EPI_ISL_18406423 EPI_ISL_18406640 EPI_ISL_18406656 EPI_ISL_18407796 EPI_ISL_18413802 EPI_ISL_18419891 EPI_ISL_18419892 EPI_ISL_18419893 EPI_ISL_18419899 EPI_ISL_18419904 EPI_ISL_18419994 EPI_ISL_18420013 EPI_ISL_18420034 EPI_ISL_18420052

AngieHinrichs commented 10 months ago

I am a bit lost on what is going on with the XCM Usher tree. Newly uploaded ones are on a branch with Spike reversions, while they are already there with mutations showing that they do not actually have, like S:F490S.

That is an unfortunate side effect of branch-specific masking, sorry. XBB has T23031C (S:F490S), but many sequences in XBB have erroneous reversions on 23031 (and many other Spike mutations). So on the XBB branch, I mask out all C23031T reversions (and many other specific reversions, see branchSpecificMask.yml). Unfortunately when there are true reversions, or recombinants that don't have specific mutations because they have a different parental lineage at that position vs. the lineage where UShER placed the sequence, the UShER tree falsely reports the mutations. It's a tradeoff -- in order to prevent a lot of bad branches that would be caused by sequences with false reversions, we allow some false mutations in recombinants (and rare true reversions).

JosetteSchoenma commented 10 months ago

I am a bit lost on what is going on with the XCM Usher tree. Newly uploaded ones are on a branch with Spike reversions, while they are already there with mutations showing that they do not actually have, like S:F490S.

That is an unfortunate side effect of branch-specific masking, sorry. XBB has T23031C (S:F490S), but many sequences in XBB have erroneous reversions on 23031 (and many other Spike mutations). So on the XBB branch, I mask out all C23031T reversions (and many other specific reversions, see branchSpecificMask.yml). Unfortunately when there are true reversions, or recombinants that don't have specific mutations because they have a different parental lineage at that position vs. the lineage where UShER placed the sequence, the UShER tree falsely reports the mutations. It's a tradeoff -- in order to prevent a lot of bad branches that would be caused by sequences with false reversions, we allow some false mutations in recombinants (and rare true reversions).

Oke. Thanks for explaining. Is this also why there are samples on the XCM tree, while they probably are not XCM, since they do have a F490S mutation?

AngieHinrichs commented 10 months ago

Is this also why there are samples on the XCM tree, while they probably are not XCM, since they do have a F490S mutation?

Hmmm. It is a little suspicious if that is the only way in which they differ. How sure are we about the samples that do vs. don't have that mutation -- are only consensus genome sequences available (like in GISAID or GenBank), or are any of them in SRA (with raw reads, viewable using Theo Sanderson's https://deeperseq.genomium.org/) I wonder?

JosetteSchoenma commented 10 months ago

Is this also why there are samples on the XCM tree, while they probably are not XCM, since they do have a F490S mutation?

Hmmm. It is a little suspicious if that is the only way in which they differ. How sure are we about the samples that do vs. don't have that mutation -- are only consensus genome sequences available (like in GISAID or GenBank), or are any of them in SRA (with raw reads, viewable using Theo Sanderson's https://deeperseq.genomium.org/) I wonder?

My mistake. They probably do not actually have an F490S, Usher is only showing it. I read your comment too quickly this morning and forgot about my own. So, if I understand this correctly now, Usher adds this F490S, because it assumes it is an error that it was missing. And in freshly uploaded ones, Usher has not added it yet, so it places the sample in the tree with a 490 reversion (and other reversions), while actually all XCM do not have F490S.

AngieHinrichs commented 10 months ago

It's a little complicated but here's how it works:

  1. UShER places a new sequence. It may or may not have S:F490S. If it does not have S:F490S, but is placed in XBB, then yes it will have a reversion after placement. But...
  2. After running UShER to add new sequences, branch-specific masking is applied. Any sequences that have been placed in XBB with a reversion on S:F490S will have that reversion erased. (If the reversion was a false reversion as I believe is true in most cases, that's a good thing; if it's a real reversion then whoops, I've made it incorrect.)

So yes, a sequence without S:F490S is temporarily placed with a reversion -- but you never see that intermediate tree, because I do branch-specific masking (and then optimization to see if we can move branches around to make the tree even better after all new sequences have been placed, and then filtering to remove implausibly long branches) before updating the tree in usher.bio.

In order to tell if a genome really does or doesn't have a mutation, it's best to look at the raw reads. Then one can get a better idea of whether there is enough read coverage at that position to be confident about the base value, and if there are enough reads, then what base value(s) they have at that position. It can still be pretty ambiguous, possibly implying a mixed infection, contamination, or other sequencing issues.

Bottom line: especially for recombinants, don't 100% trust the UShER tree to tell what mutations a sequence has or does not have -- there is a lot of masking and imputation of missing data. In order to be sure, it's necessary to look at the genome sequence. If the sequence has a GenBank accession, Theo's https://gensplore.theo.io/ is useful. If raw data for the sequence has been submitted to SRA, then Theo's https://deeperseq.genomium.org/ can show the full detail. If you're working with sequences downloaded from GISAID then nextclade is the way to go (or the GISAID details page's list of mutations for each sequence).

aviczhl2 commented 10 months ago

It's a little complicated but here's how it works:

  1. UShER places a new sequence. It may or may not have S:F490S. If it does not have S:F490S, but is placed in XBB, then yes it will have a reversion after placement. But...
  2. After running UShER to add new sequences, branch-specific masking is applied. Any sequences that have been placed in XBB with a reversion on S:F490S will have that reversion erased. (If the reversion was a false reversion as I believe is true in most cases, that's a good thing; if it's a real reversion then whoops, I've made it incorrect.)

So yes, a sequence without S:F490S is temporarily placed with a reversion -- but you never see that intermediate tree, because I do branch-specific masking (and then optimization to see if we can move branches around to make the tree even better after all new sequences have been placed, and then filtering to remove implausibly long branches) before updating the tree in usher.bio.

In order to tell if a genome really does or doesn't have a mutation, it's best to look at the raw reads. Then one can get a better idea of whether there is enough read coverage at that position to be confident about the base value, and if there are enough reads, then what base value(s) they have at that position. It can still be pretty ambiguous, possibly implying a mixed infection, contamination, or other sequencing issues.

Bottom line: especially for recombinants, don't 100% trust the UShER tree to tell what mutations a sequence has or does not have -- there is a lot of masking and imputation of missing data. In order to be sure, it's necessary to look at the genome sequence. If the sequence has a GenBank accession, Theo's https://gensplore.theo.io/ is useful. If raw data for the sequence has been submitted to SRA, then Theo's https://deeperseq.genomium.org/ can show the full detail. If you're working with sequences downloaded from GISAID then nextclade is the way to go (or the GISAID details page's list of mutations for each sequence).

Maybe shall do branch-specific unmasking of these positions for XCM(and XCT and future XBB/DV.7.1 recombs)?

JosetteSchoenma commented 10 months ago

@AngieHinrichs Since you say: 'So yes, a sequence without S:F490S is temporarily placed with a reversion -- but you never see that intermediate tree, because I do branch-specific masking (and then optimization to see if we can move branches around to make the tree even better after all new sequences have been placed, and then filtering to remove implausibly long branches) before updating the tree in usher.bio.'

Does that mean that you did not do the branch-specific masking yet for XCM? Because there are many XCM with F490S on that tree and if I do a GISAID search for XCM with the query and then add F490S to the query afterwards, there are 0 samples left. So XCM really does not have F490S. And I guess that is the same for the other Spike mutations that show as reverted.

I think this is what @aviczhl2 means as well.

AngieHinrichs commented 10 months ago

Maybe shall do branch-specific unmasking of these positions for XCM(and XCT and future XBB/DV.7.1 recombs)?

In the current implementation of branch-specific masking, there is not a way to exclude descendant branches from the masking. For example, if a reversion is masked in the XBB branch, then it is erased from any sequence that is placed anywhere within the XBB branch, even if it belongs to some other lineage where we don't want it erased.

If there were already a mechanism to exclude some descendants from branch-specific masking, then I already would do that for recombinants (and those rare true-reversion lineages). But it would require some new development and configuration work to add that mechanism to matUtils and branchSpecificMask.{py,yml}. Also, currently affected branches would have to be removed from the tree so their sequences could be added back again with the new mechanism in place. This is in the category of "doable but a significant new chunk of work for me" which is already a long list. The question is the magnitude of the damage done by the current implementation. If the appearance of known false mutations in recombinants is disappointing to you, I am sorry about that. However, if the current drawbacks cause incorrect clustering in the tree, then I will look into fixing it.

In this particular case I'm not convinced that there is a clustering error; if there are differences between sequences at S:490, then I would suspect sequencing issues unless there's good evidence that sequencing issues can't explain the differences.

AngieHinrichs commented 10 months ago

Does that mean that you did not do the branch-specific masking yet for XCM? Because there are many XCM with F490S on that tree and if I do a GISAID search for XCM with the query and then add F490S to the query afterwards, there are 0 samples left. So XCM really does not have F490S.

XCM sequences do not have F490S, but XCM in the tree falsely appears to have F490S -- that is caused by branch-specific masking in the XBB branch where XCM is placed. Branch-specific masking is applied every day after UShER places new sequences in the tree, and you are seeing its undesirable effect on XCM. Branch-specific masking prevents a lot of bad-branch structural problems in XBB as a whole, but it can cause mutation accuracy problems for recombinants and rare cases of true reversions.

JosetteSchoenma commented 10 months ago

Thanks for your elaborate answers, @AngieHinrichs . I certainly am not wishing to add to your workload. Simply trying to understand how this all works. I can indeed imagine your workload is more then big enough. 🙏

Would it maybe be a better idea to put XCM on the BA.2.75/DV.7.1 branch than on the XBB branch? Or would that simply cause the same problem for other mutations?

AngieHinrichs commented 10 months ago

Would it maybe be a better idea to put XCM on the BA.2.75/DV.7.1 branch than on the XBB branch?

Maybe it would, but usher (and matOptimize) will place a branch wherever it has the fewest additional mutations. So even if I could somehow force it to move to the other parent branch, matOptimize would probably move it back the next day.

Or would that simply cause the same problem for other mutations?

Yes, that is also possible.