lenaschimmel / sc2rf

SARS-Cov-2 Recombinant Finder for fasta sequences
MIT License
48 stars 13 forks source link

Fix or remove 21I and 21J #10

Open lenaschimmel opened 2 years ago

lenaschimmel commented 2 years ago

The list of supported clades, which is written down in mappings.csv, was taken from nextclade-data, and contains three Delta clades: 21A, 21I and 21J. The latter two do not map to single pango lineages.

When I switched from nextclade to cov-spectrum to generate the contents of virus_properties.json, I just used mappings.csv to get the pango lineages and make requests to the cov-spectrum API. The pseudo-names "AY (higher)" and "AY (lower)" are not recognized, and the variant definitions in virus_properties.json remain empty. The consequence:

This tool claims to support 21I and 21J, but it currently does not.

To be honest, I never really got around to get an overview of the Delta/AY diversity, as I got into SARS-CoV-2 genomics when Delta was already declining and Omicron on the rise. Thus, I have no clear plan on how to solve this. I think, once I get my Delta-knowledge up to date, the technical solution might be quite easy.

AngieHinrichs commented 2 years ago

Ah, I missed your explanation in #8 while I was writing in #4! Makes sense, and hopefully the mutation frequencies are not a hard requirement in virus_properties.json because I can easily construct a virus_properties.json with just the mutations for 21I and 21J, but it would be more complicated to include allele frequencies.

[and accuracy of allele frequencies is subject to amplicon dropout issues in Delta and Omicron... working with SARS-CoV-2 sequences has been an education in how many ways there are for sequences to be not-quite-right in ways that confound attempts at phylogenetics...]

lenaschimmel commented 2 years ago

If you want to proceed before I implement a proper fix:

Within virus_properties.json, both mutation and proportion are mandatory. count is totally ignored and only in there because it comes like this from the cov-spectrum API.

So if you want to give your own mutations lists a try, you can put "proportion": 1.0 in there for testing purposes. It's used as a simple cut-off value (below or above --mutation-threshold), after that, the value has no influence. (It probably will have more influence in the mid-term future when I modify my algorithms.)

Or you modify mappings.csv and put in some actual pango lineages instead of AY (higher) and AY (lower) and then re-run it with --rebuild-examples.

AngieHinrichs commented 2 years ago

Here is a python snippet that fetches my nextstrain.clade-mutations.tsv from github and overwrites ./virus_properties.json with its mutations:

import re
import requests
import json

props = { "schemaVersion": "s2r 0.0.1",
          "comment": "File format is inspired by Nextstrains virus_properties, but evolved to be more like cov-spectrum API.",
          "nucMutLabelMap": {},
          "nucMutLabelMapReverse": {} }

r = requests.get("https://github.com/ucscGenomeBrowser/kent/raw/master/src/hg/utils/otto/sarscov2phylo/nextstrain.clade-mutations.tsv")
for line in r.text.split("\n"):
    if not line:
        continue
    clade, mutPath = line.split("\t")
    clade = clade.split(' ')[0]
    muts = filter(lambda x: x and not x.endswith('N'), re.split('[ ,>]', mutPath))
    mutProps = [ { "mutation": mut, "proportion": 1.0 } for mut in muts ]
    props["nucMutLabelMapReverse"][clade] = mutProps

with open('virus_properties.json', newline='', mode='w') as jsonfile:
    json.dump(props, jsonfile, indent=4)

--But it also requires 21I and 21J to be added to this line of search_recombinants.py [I tried several possible syntaxes for passing multiple clades with -c, but none of them worked] [ Edit: ha, because I didn't just try -c all]:

    parser.add_argument('--clades', '-c', nargs='*', default=['20I','20H','20J','21A', '21K','21L', '21BA3'], help='List of clades which are considered as potential parents. Use Nextclade names, i.e. "21A". Also accepts "all".')

Now it identifies 471.genbank.aligned.fa.gz as 21J / 21K.

BTW the AY. numbers for 21I and 21J are not cleanly divided into lower and higher lineage numbers, they're all interspersed, so I'm not really sure how to describe them in mapping.csv. Both 21I and 21J are `AY., they're just differentAY.*`. 😄

AngieHinrichs commented 2 years ago

Just for fun I also tried making virus_properties.json from pango.clade-mutations.tsv and adding a line for each AY. lineage to mapping.csv but it didn't work for this example, even with --unique 1 -- maybe there aren't enough unique mutations in the fine-grained AY. lineages for the approach to work??

It would be cool to have a hierarchical spec so that for example when a 21J match is found, next the mutations in the relevant AY.* lineages could be searched to see if any of them can give a more specific match. But in the meantime we can just search some private mutations on covSPECTRUM too. :)

chaoran-chen commented 2 years ago

Looking at CoVariants, my understanding is that you could, for example, define 20I as B.1.617.2* & S:A222V and 20J as B.1.617.2* & N:G215C. There are two different ways to query them in LAPIS; in this case, I believe that the simplest would be to use the variantQuery-parameter (this is what is used for the "advanced search" on CoV-Spectrum).

B.1.617.2* & S:A222V: https://lapis.cov-spectrum.org/open/v1/sample/nuc-mutations?variantQuery=B.1.617.2*+%26+S%3AA222V

B.1.617.2* & N:G215C: https://lapis.cov-spectrum.org/open/v1/sample/nuc-mutations?variantQuery=B.1.617.2*+%26+N%3AG215C

If you could map a Nextstrain clade to multiple AY lineages, you can also write queries like AY.1* | AY.2* or AY.* & !(AY.1* | AY.2*).

chaoran-chen commented 2 years ago

By the way, (rudimentary) support for Nextstrain clades already exists. Here are the Nextstrain clades that have B.1.617.2* as pango lineage:

https://lapis.cov-spectrum.org/open/v1/sample/aggregated?fields=nextstrainClade&pangoLineage=B.1.617.2*

And here are the nucleotide mutations of 21J: https://lapis.cov-spectrum.org/open/v1/sample/nuc-mutations?nextstrainClade=21J%20(Delta)

What I don't like yet is that you have to write "21J (Delta)" instead of just "21J", so there will be changes/improvements.

(@corneliusroemer, not sure if you actually know that.)

FedeGueli commented 2 years ago

@chaoran-chen that kind of search (B.1.617.2 & S:A222V) leads you to , for example, AY.4.2 too that is 21J or am i wrong? And i remember S:222V popping up here and there in multiple lineages so i do not think it could define well 21I

i ran that search for one day in november in belgium and it found both 21I and 21J lineages: Example

chaoran-chen commented 2 years ago

@FedeGueli, yes, you're right. I was just reading https://covariants.org/variants/21I.Delta and thought that S:A222V is specific to 21I. But as I now also see from the following link, that's not the case at all:

https://lapis.cov-spectrum.org/open/v1/sample/aggregated?variantQuery=B.1.617.2*+%26+S%3AA222V&fields=nextstrainClade

@lenaschimmel, maybe it's then the best if you directly filter for the Nextstrain clade? Here, you can also easily get a list of Nextstrain clades.

lenaschimmel commented 2 years ago

Thanks, everyone! That's a lot of useful input with multiple possible ways to solve this.

I really want to provide a quick fix for this problem within the next few hours. Having 21I and 21J broken is kind of embarrassing when the main use of this software right now might be working with Delta + Omicron recombinants.

At the moment, it seems to me that a 1-to-1-mapping of all nextstrain clades and pango lineages can never work, even though it works fine for most of the non-Delta variants (BA.3 is another problem). And I don't think that one naming convention is inherently better or worse than the other, so I'd like to support both, without any need for a total 1-to-1-mapping. This should be possible, but I'm not sure if I can do it today, hence my plan to do a quick fix before that.

Does that sound reasonable?

@AngieHinrichs wrote:

It would be cool to have a hierarchical spec so that for example when a 21J match is found, next the mutations in the relevant AY.* lineages could be searched to see if any of them can give a more specific match. But in the meantime we can just search some private mutations on covSPECTRUM too. :)

Yeah, I'd really like that! Given that this needs some major changes in the code, and this is just my hobby project, it should be doable in a timeframe of 1-3 weeks.

corneliusroemer commented 2 years ago

@chaoran-chen This is excellent: https://lapis.cov-spectrum.org/open/v1/sample/nuc-mutations?nextstrainClade=21J%20(Delta)

@lenaschimmel you can use this to find mutation proportions in each Delta clade!

The only slight drawback is that this uses only open data and not GISAID, so it's quite UK/US/Germany centric - but that's better than nothing.

I think for Delta it's much better to use 21I/J rather than B.1.617.2* since I/J are actually quite different beasts.

lenaschimmel commented 2 years ago

Already working on it. Currently it seems as if full support for both nextstrain clades and pango lineages is not too hard, so that I don't need to provide a quick fix before doing the real work.

I used the cov-spectrum page for Delta to get a list of all AY.* which make up at least 3% of all Deltas. I checked both GISAID and Open Data, results were almost identical.

Current mapping.csv looks like this:

Embedded CSV Table - if it does not work, please notify @lenaschimmel

You can now mix and match both naming schemes on the command line, i.e. --clades 21I 21J BA.1 BA.2.

Biggest remaining problem now is that lineages with large overlaps in their mutations cancel each other out, so if you select both a parent like 21A and multiple of their children like 21I and 21J, they might all end up with 0 unique mutations and not get used at all. So --clades all is now a very bad idea, I wonder if I should remove / disable that flag until I get rid of this "unique mutations" approach.

You can use --verboseto get an output like this. If there are very few unique mutations, the line will be highlighted in yellow (< 5) or red (< 3).

Clade  Alpha / B.1.1.7 / 20I has 28 mutations, of which 19 are unique.
Clade  Beta / B.1.351 / 20H has 20 mutations, of which 13 are unique.
Clade  Gamma / P.1 / 20J has 32 mutations, of which 22 are unique.
Clade  Delta / 21I has 27 mutations, of which 7 are unique.
Clade  Delta / 21J has 31 mutations, of which 0 are unique.
Clade  Delta / AY.4 has 34 mutations, of which 1 are unique.
Clade  Omicron / BA.1 / 21K has 51 mutations, of which 12 are unique.
Clade  Omicron / BA.2 / 21L has 66 mutations, of which 17 are unique.
Clade  Omicron / BA.3 has 45 mutations, of which 3 are unique.

For now, I tried to select a reasonable default for when you don't use the --clades arg at all: ['20I','20H','20J', '21I', '21J', 'BA.1', 'BA.2', 'BA.3'].