opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Variant type for small Multiple Nucleotide Variants required #809

Closed AsierGonzalez closed 3 years ago

AsierGonzalez commented 4 years ago

In the 19.11 and 20.02 UniProt evidence file submissions there are seven invalid evidence strings because their variant type is "snp multiple", which is not one of the accepted values by the JSON schema (valid options are "snp single", "snp snp interaction" and "structural variant"):

UniProt
"variant": {
    "id": "http://identifiers.org/dbsnp/rs281865416",
    "type": "snp multiple"
  }

According to Andrew Nightingale from UniProt, there was a discussion back in 2017 and someone in OT told them that they could use "snp multiple" for this type of variants. From what I have seen, such a change has never been added to the OT JSON schema and until 19.11 they never used that type. Does someone know anything about this?

Interestingly, the same variant in the example above is also reported by EVA but as a "snp single":

EVA
"variant": {
    "id": "http://identifiers.org/dbsnp/rs281865416",
    "type": "snp single"
  }

According to dbSNP, this variant is a Multiple Nucleotide Variation (MNV), a delins following HGVS nomenclature given that it encompasses three nucleotide changes: SE [TCGGAG] > RG [AGGGGG]. Therefore, it seems that annotating it as "snp single" is inaccurate.

There are two related questions that need to be answered:

  1. Do we need to add a new variant type to the schema to annotate small MNVs? This is simple to do but what I don't know is if the variant type is used somewhere by the pipeline or webapp. At a first glance I couldn't find it shown on the genetic evidence table. I have asked both UniProt and EVA to give us an estimate of the number of MNVs they may have in their database, which is likely much higher than the 7 UniProt evidence strings annotated with that type. For example there are 2643 variants referred to as "indels" in the 19.11 clinvar file, 1657 unique rsids. This will give us a better idea of the need to add a new variant type.

  2. If so, should the new variant type be called something like "snp multiple" or MNV as dbSNP do?

For context, the current use of different variant types in the evidence strings is as follows:

Variant type Count
snp single 376945
snp snp interaction 0
structural variant 121
No variant info 8157412

snp snp interaction not being used is not a surprise, as GWAS is the only data source that has them but the script processing the snps ignored them, so we asked GWAS catalog to filter them out. In the 19.06 gwas snp file there were 115 snp interactions. Should we try to recover them?

iandunham commented 4 years ago

Indels are a separate type to snp_multiple or MNV. Do we have a variant type of indels? we should.

This is the first I have heard about the MNVs although obviously they are theoretically feasible. One would normally think of this as two haplotypes each comprising 3 SNPs. The haplotype approach is flexible as it can allow alternate configurations as well. However if dbSNP allows this type of variant then maybe we should. Since there are not many, we need to be careful that we don't end up supporting something that isn't useful. Maybe haplotype would be better. Ask one of hte Genetics team

tskir commented 4 years ago

To continue the discussion on this...

ClinVar variant counts

As @AsierGonzalez requested, the number of variants in ClinVar (2019/09 release) are:

MNVs

@iandunham I agree that MNVs are ideally thought of as haplotypes; however, data sources frequently don't provide them as such. I, too, don't think that we should have a separate term for MNVs, precisely because of the confusion it generates, and because of the difficulty of differentiating between MNV and an indel. But we definitely definitely need some new variant types beside existing "SNPs".

tskir commented 4 years ago

Ontology for variation types

As Andrew Nightingale suggested in the e-mail thread from which this issue has originated, there is a standard ontology for describing DNA variation types: the Variation Ontology. I believe using it is the best approach here. For example, we may adopt the following closed set of allowed values. We could replace the existing terms:

Existing term Ontology term Ontology term label
snp single http://purl.obolibrary.org/obo/VariO_0136 DNA substitution
snp snp interaction http://purl.obolibrary.org/obo/VariO_0237 Genetic interaction
structural variant http://purl.obolibrary.org/obo/VariO_0155 Variation affecting DNA structure

And add the following new terms to account for indels:

Ontology term Ontology term label
http://purl.obolibrary.org/obo/VariO_0141 DNA deletion
http://purl.obolibrary.org/obo/VariO_0142 DNA insertion
http://purl.obolibrary.org/obo/VariO_0143 DNA indel

I think this set of 6 variant types is quite succinct while providing a way to describe most commonly occuring variants.

tskir commented 4 years ago

Alternatively the more widely known SO (Sequence Ontology) can be used, although it lacks a term for genetic interaction or SNP to SNP interaction. (Still, we could ask such a term to be added.)

AsierGonzalez commented 4 years ago

This will be reviewed when the data submission guidelines are designed (see #865)

AsierGonzalez commented 3 years ago

Alternatively, we may decide to drop this field altogether as it does not seem to be used anywhere

AsierGonzalez commented 3 years ago

This is the breakdown of variant types per data source in release 20.06 (used instead of 20.09 because UniProt evidence were still processed):

sourceID type variant_count
eva snp single 115,779
eva structural variant 781
ot_genetics_portal SNP 359,778
ot_genetics_portal insertion 16,181
ot_genetics_portal deletion 14,155
phewas_catalog snp single 182,694
uniprot structural variant 31
uniprot snp single 32164
uniprot snp multiple 7
AsierGonzalez commented 3 years ago

As seen in the table above there are four data sources that provide variant information: EVA, OT Genetics Portal, PheWAS catalog and UniProt. The problem is that the four use different approaches to calculate the variant type:

AsierGonzalez commented 3 years ago

The variant type has been excluded from the new JSON schema (more info in 1249) because this value is not used anywhere, the list is not complete eneough and there is no unified method to calculate it for the different data sources. As a consequence, this field will be removed altogether from the evidence files, so there is no need to add any new fields or map them to an ontology.