Closed AsierGonzalez closed 3 years ago
Indels are a separate type to snp_multiple or MNV. Do we have a variant type of indels? we should.
This is the first I have heard about the MNVs although obviously they are theoretically feasible. One would normally think of this as two haplotypes each comprising 3 SNPs. The haplotype approach is flexible as it can allow alternate configurations as well. However if dbSNP allows this type of variant then maybe we should. Since there are not many, we need to be careful that we don't end up supporting something that isn't useful. Maybe haplotype would be better. Ask one of hte Genetics team
To continue the discussion on this...
As @AsierGonzalez requested, the number of variants in ClinVar (2019/09 release) are:
@iandunham I agree that MNVs are ideally thought of as haplotypes; however, data sources frequently don't provide them as such. I, too, don't think that we should have a separate term for MNVs, precisely because of the confusion it generates, and because of the difficulty of differentiating between MNV and an indel. But we definitely definitely need some new variant types beside existing "SNPs".
As Andrew Nightingale suggested in the e-mail thread from which this issue has originated, there is a standard ontology for describing DNA variation types: the Variation Ontology. I believe using it is the best approach here. For example, we may adopt the following closed set of allowed values. We could replace the existing terms:
Existing term | Ontology term | Ontology term label |
---|---|---|
snp single | http://purl.obolibrary.org/obo/VariO_0136 | DNA substitution |
snp snp interaction | http://purl.obolibrary.org/obo/VariO_0237 | Genetic interaction |
structural variant | http://purl.obolibrary.org/obo/VariO_0155 | Variation affecting DNA structure |
And add the following new terms to account for indels:
Ontology term | Ontology term label |
---|---|
http://purl.obolibrary.org/obo/VariO_0141 | DNA deletion |
http://purl.obolibrary.org/obo/VariO_0142 | DNA insertion |
http://purl.obolibrary.org/obo/VariO_0143 | DNA indel |
I think this set of 6 variant types is quite succinct while providing a way to describe most commonly occuring variants.
Alternatively the more widely known SO (Sequence Ontology) can be used, although it lacks a term for genetic interaction or SNP to SNP interaction. (Still, we could ask such a term to be added.)
This will be reviewed when the data submission guidelines are designed (see #865)
Alternatively, we may decide to drop this field altogether as it does not seem to be used anywhere
This is the breakdown of variant types per data source in release 20.06 (used instead of 20.09 because UniProt evidence were still processed):
sourceID | type | variant_count |
---|---|---|
eva | snp single | 115,779 |
eva | structural variant | 781 |
ot_genetics_portal | SNP | 359,778 |
ot_genetics_portal | insertion | 16,181 |
ot_genetics_portal | deletion | 14,155 |
phewas_catalog | snp single | 182,694 |
uniprot | structural variant | 31 |
uniprot | snp single | 32164 |
uniprot | snp multiple | 7 |
As seen in the table above there are four data sources that provide variant information: EVA, OT Genetics Portal, PheWAS catalog and UniProt. The problem is that the four use different approaches to calculate the variant type:
SNP
deletion
*()**insertion
()
() NOTE: There is a bug and deletion
and insertion
should be defined the other way around.snp single
.Variation Type
available in dbSNP.The variant type has been excluded from the new JSON schema (more info in 1249) because this value is not used anywhere, the list is not complete eneough and there is no unified method to calculate it for the different data sources. As a consequence, this field will be removed altogether from the evidence files, so there is no need to add any new fields or map them to an ontology.
In the 19.11 and 20.02 UniProt evidence file submissions there are seven invalid evidence strings because their variant type is "snp multiple", which is not one of the accepted values by the JSON schema (valid options are "snp single", "snp snp interaction" and "structural variant"):
According to Andrew Nightingale from UniProt, there was a discussion back in 2017 and someone in OT told them that they could use "snp multiple" for this type of variants. From what I have seen, such a change has never been added to the OT JSON schema and until 19.11 they never used that type. Does someone know anything about this?
Interestingly, the same variant in the example above is also reported by EVA but as a "snp single":
According to dbSNP, this variant is a Multiple Nucleotide Variation (MNV), a delins following HGVS nomenclature given that it encompasses three nucleotide changes: SE [TCGGAG] > RG [AGGGGG]. Therefore, it seems that annotating it as "snp single" is inaccurate.
There are two related questions that need to be answered:
Do we need to add a new variant type to the schema to annotate small MNVs? This is simple to do but what I don't know is if the variant type is used somewhere by the pipeline or webapp. At a first glance I couldn't find it shown on the genetic evidence table. I have asked both UniProt and EVA to give us an estimate of the number of MNVs they may have in their database, which is likely much higher than the 7 UniProt evidence strings annotated with that type. For example there are 2643 variants referred to as "indels" in the 19.11 clinvar file, 1657 unique rsids. This will give us a better idea of the need to add a new variant type.
If so, should the new variant type be called something like "snp multiple" or MNV as dbSNP do?
For context, the current use of different variant types in the evidence strings is as follows:
snp snp interaction not being used is not a surprise, as GWAS is the only data source that has them but the script processing the snps ignored them, so we asked GWAS catalog to filter them out. In the 19.06 gwas snp file there were 115 snp interactions. Should we try to recover them?