Closed DSuveges closed 3 years ago
Re-assigning to 20.11 as UniProt evidence assessment is ongoing as noted in our release intentions document
The exploration of this overlap led to the removal of Uniprot as datasource in 20.09.
Uniprot
data source was only capturing Clinvar variants (at different pathogenicity filters). Since EVA provides this same information, to extract ClinVar evidence, we decided to stick with the latter.
Now that we have a new EVA evidence file that includes ClinVar entries of all clinical relevance variants we can do the comparison again and there are still 5k variants that only appear in UniProt:
In some cases, those variants are in ClinVar but they don't seem to have been submitted by OMIM and the condition may be missing (e.g. rs1000990130-RCV000731325.1). In some other cases the variant is not in ClinVar (rs1004881058), probably it was the curator who identified the varaint after checking a paper mentioned in OMIM.
We know Uniprot submits some evidence to ClinVar but not all. We want to evaluate the true value of the non-submitted evidence, to decide if we want to rescue the Uniprot datasource
Using the latest uniprot and clinvar submissions, we want to know how many Uniprot evidence overlap/don't overlap with Clinvar:
Just a simple check on variant RSIDs is enough to observe there are 20,994 variants in the uniprot
datasource, for which we only have 16,899 (80%) captured in any resource in the current evidence.
Some examples of the missing evidence are listed next:
up = (evdOld
.filter(col("sourceID") == "uniprot")
.select(col("target.id").alias("targetId"),
col("disease.id").alias("diseaseId"),
col("disease.name").alias("diseaseFromSource"),
col("unique_association_fields.dbSNPs").alias("variantRsId"),
col("evidence.variant2disease.unique_experiment_reference").alias("literature"))
.withColumn('literature', regexp_replace('literature', 'http://europepmc.org/abstract/MED/', ''))
.distinct()
.persist()
)
up.join(evd.select("variantRsId").distinct(), on = ["variantRsId"], how = "left_anti").show(truncate = False)
+------------+---------------+---------------+---------------------------------------------------------------------------------------------------------------------------+----------+
|variantRsId |targetId |diseaseId |diseaseFromSource |literature|
+------------+---------------+---------------+---------------------------------------------------------------------------------------------------------------------------+----------+
|rs1035791118|ENSG00000115705|Orphanet_95716 |Thyroid dyshormonogenesis 2A |12938097 |
|rs113173389 |ENSG00000102393|Orphanet_324 |Fabry disease |9100224 |
|rs1187685038|ENSG00000124615|Orphanet_308386|Molybdenum cofactor deficiency, complementation group A |9921896 |
|rs1187685038|ENSG00000124615|Orphanet_833 |Molybdenum cofactor deficiency, complementation group A |9921896 |
|rs1187685038|ENSG00000124615|Orphanet_99732 |Molybdenum cofactor deficiency, complementation group A |9921896 |
|rs1349176732|ENSG00000198626|Orphanet_3286 |Ventricular tachycardia, catecholaminergic polymorphic, 1, with or without atrial dysfunction and/or dilated cardiomyopathy|25372681 |
|rs1369490553|ENSG00000124479|Orphanet_649 |Norrie disease |8589700 |
|rs144965179 |ENSG00000118271|EFO_0004129 |Amyloidosis, transthyretin-related |17577687 |
|rs144965179 |ENSG00000118271|Orphanet_85447 |Amyloidosis, transthyretin-related |17577687 |
|rs144965179 |ENSG00000118271|Orphanet_85451 |Amyloidosis, transthyretin-related |17577687 |
|rs144965179 |ENSG00000118271|Orphanet_271861|Amyloidosis, transthyretin-related |17577687 |
|rs1474900361|ENSG00000181004|Orphanet_110 |Bardet-Biedl syndrome 12 |21344540 |
|rs1474900361|ENSG00000181004|EFO_0009023 |Bardet-Biedl syndrome 12 |21344540 |
|rs193922864 |ENSG00000196218|Orphanet_99741 |Malignant hyperthermia 1 |16163667 |
|rs193922864 |ENSG00000196218|Orphanet_423 |Malignant hyperthermia 1 |16163667 |
|rs193922864 |ENSG00000196218|EFO_0009071 |Malignant hyperthermia 1 |16163667 |
|rs200163795 |ENSG00000136931|EFO_0004266 |Premature ovarian failure 7 |19246354 |
|rs200163795 |ENSG00000136931|Orphanet_399805|Spermatogenic failure 8 |19246354 |
|rs267606727 |ENSG00000169105|Orphanet_2953 |Ehlers-Danlos syndrome, musculocontractural type 1 |20004762 |
|rs281865247 |ENSG00000167995|Orphanet_1243 |Macular dystrophy, vitelliform, 2 |14517959 |
+------------+---------------+---------------+---------------------------------------------------------------------------------------------------------------------------+----------+
only showing top 20 rows
However, when looking at some of them in the platform they are well-captured relationships.
If we look at associations, Uniprot contains 4,877 target-disease pairs:
If we attend to some examples of the 1,547 associations present in uniprot but absent in ClinVar (e.g. MONDO_0008199 - ENSG00000138246):
+---------------+-------------+-----------------+-----------+----------+
| targetId| diseaseId|diseaseFromSource|variantRsId|literature|
+---------------+-------------+-----------------+-----------+----------+
|ENSG00000138246|MONDO_0008199|Parkinson disease|rs145242123| 25393719|
|ENSG00000138246|MONDO_0008199|Parkinson disease|rs146930051| 25393719|
|ENSG00000138246|MONDO_0008199|Parkinson disease|rs387907571| 24218364|
|ENSG00000138246|MONDO_0008199|Parkinson disease|rs766013346| 25393719|
+---------------+-------------+-----------------+-----------+----------+
They can also be found in Clinvar but mapped to a slightly different term:
+---------------+-----------+-----------------------------+-----------+------------------------------+
|targetId |diseaseId |diseaseFromSource |variantRsId|literature |
+---------------+-----------+-----------------------------+-----------+------------------------------+
|ENSG00000138246|EFO_0002508|parkinson disease, late-onset|rs387907571|[20301402, 20482602, 24218364]|
+---------------+-----------+-----------------------------+-----------+------------------------------+
After discussion with @ireneisdoomed and @DSuveges, we agreed the value of adding ~20% not captured variants exceeds the price of adding duplicates. For this reason, we will be rescuing the uniprot
datasource aiming at 20.04 (new evidence) for the first release affected.
The first round shows >8000 (out of 21k) UniProt variants only appear in UniProt evidence. They seem to be in ClinVar but filtered out from EVA evidence due to clinical significance threshold