opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Measure clash/redundancy between EVA/ClinVar and Uniprot #1141

Closed DSuveges closed 3 years ago

DSuveges commented 4 years ago

The first round shows >8000 (out of 21k) UniProt variants only appear in UniProt evidence. They seem to be in ClinVar but filtered out from EVA evidence due to clinical significance threshold

andrewhercules commented 4 years ago

Re-assigning to 20.11 as UniProt evidence assessment is ongoing as noted in our release intentions document

d0choa commented 4 years ago

The exploration of this overlap led to the removal of Uniprot as datasource in 20.09.

Uniprot data source was only capturing Clinvar variants (at different pathogenicity filters). Since EVA provides this same information, to extract ClinVar evidence, we decided to stick with the latter.

AsierGonzalez commented 3 years ago

Now that we have a new EVA evidence file that includes ClinVar entries of all clinical relevance variants we can do the comparison again and there are still 5k variants that only appear in UniProt:

In some cases, those variants are in ClinVar but they don't seem to have been submitted by OMIM and the condition may be missing (e.g. rs1000990130-RCV000731325.1). In some other cases the variant is not in ClinVar (rs1004881058), probably it was the curator who identified the varaint after checking a paper mentioned in OMIM.

d0choa commented 3 years ago

We know Uniprot submits some evidence to ClinVar but not all. We want to evaluate the true value of the non-submitted evidence, to decide if we want to rescue the Uniprot datasource

Using the latest uniprot and clinvar submissions, we want to know how many Uniprot evidence overlap/don't overlap with Clinvar:

d0choa commented 3 years ago

Just a simple check on variant RSIDs is enough to observe there are 20,994 variants in the uniprot datasource, for which we only have 16,899 (80%) captured in any resource in the current evidence.

Some examples of the missing evidence are listed next:

up = (evdOld
      .filter(col("sourceID") == "uniprot")
      .select(col("target.id").alias("targetId"),
              col("disease.id").alias("diseaseId"),
              col("disease.name").alias("diseaseFromSource"),
              col("unique_association_fields.dbSNPs").alias("variantRsId"),
              col("evidence.variant2disease.unique_experiment_reference").alias("literature"))
      .withColumn('literature', regexp_replace('literature', 'http://europepmc.org/abstract/MED/', ''))
      .distinct()
      .persist()
      )

up.join(evd.select("variantRsId").distinct(), on = ["variantRsId"], how = "left_anti").show(truncate = False)
+------------+---------------+---------------+---------------------------------------------------------------------------------------------------------------------------+----------+
|variantRsId |targetId       |diseaseId      |diseaseFromSource                                                                                                          |literature|
+------------+---------------+---------------+---------------------------------------------------------------------------------------------------------------------------+----------+
|rs1035791118|ENSG00000115705|Orphanet_95716 |Thyroid dyshormonogenesis 2A                                                                                               |12938097  |
|rs113173389 |ENSG00000102393|Orphanet_324   |Fabry disease                                                                                                              |9100224   |
|rs1187685038|ENSG00000124615|Orphanet_308386|Molybdenum cofactor deficiency, complementation group A                                                                    |9921896   |
|rs1187685038|ENSG00000124615|Orphanet_833   |Molybdenum cofactor deficiency, complementation group A                                                                    |9921896   |
|rs1187685038|ENSG00000124615|Orphanet_99732 |Molybdenum cofactor deficiency, complementation group A                                                                    |9921896   |
|rs1349176732|ENSG00000198626|Orphanet_3286  |Ventricular tachycardia, catecholaminergic polymorphic, 1, with or without atrial dysfunction and/or dilated cardiomyopathy|25372681  |
|rs1369490553|ENSG00000124479|Orphanet_649   |Norrie disease                                                                                                             |8589700   |
|rs144965179 |ENSG00000118271|EFO_0004129    |Amyloidosis, transthyretin-related                                                                                         |17577687  |
|rs144965179 |ENSG00000118271|Orphanet_85447 |Amyloidosis, transthyretin-related                                                                                         |17577687  |
|rs144965179 |ENSG00000118271|Orphanet_85451 |Amyloidosis, transthyretin-related                                                                                         |17577687  |
|rs144965179 |ENSG00000118271|Orphanet_271861|Amyloidosis, transthyretin-related                                                                                         |17577687  |
|rs1474900361|ENSG00000181004|Orphanet_110   |Bardet-Biedl syndrome 12                                                                                                   |21344540  |
|rs1474900361|ENSG00000181004|EFO_0009023    |Bardet-Biedl syndrome 12                                                                                                   |21344540  |
|rs193922864 |ENSG00000196218|Orphanet_99741 |Malignant hyperthermia 1                                                                                                   |16163667  |
|rs193922864 |ENSG00000196218|Orphanet_423   |Malignant hyperthermia 1                                                                                                   |16163667  |
|rs193922864 |ENSG00000196218|EFO_0009071    |Malignant hyperthermia 1                                                                                                   |16163667  |
|rs200163795 |ENSG00000136931|EFO_0004266    |Premature ovarian failure 7                                                                                                |19246354  |
|rs200163795 |ENSG00000136931|Orphanet_399805|Spermatogenic failure 8                                                                                                    |19246354  |
|rs267606727 |ENSG00000169105|Orphanet_2953  |Ehlers-Danlos syndrome, musculocontractural type 1                                                                         |20004762  |
|rs281865247 |ENSG00000167995|Orphanet_1243  |Macular dystrophy, vitelliform, 2                                                                                          |14517959  |
+------------+---------------+---------------+---------------------------------------------------------------------------------------------------------------------------+----------+
only showing top 20 rows

However, when looking at some of them in the platform they are well-captured relationships.

If we look at associations, Uniprot contains 4,877 target-disease pairs:

If we attend to some examples of the 1,547 associations present in uniprot but absent in ClinVar (e.g. MONDO_0008199 - ENSG00000138246):

+---------------+-------------+-----------------+-----------+----------+
|       targetId|    diseaseId|diseaseFromSource|variantRsId|literature|
+---------------+-------------+-----------------+-----------+----------+
|ENSG00000138246|MONDO_0008199|Parkinson disease|rs145242123|  25393719|
|ENSG00000138246|MONDO_0008199|Parkinson disease|rs146930051|  25393719|
|ENSG00000138246|MONDO_0008199|Parkinson disease|rs387907571|  24218364|
|ENSG00000138246|MONDO_0008199|Parkinson disease|rs766013346|  25393719|
+---------------+-------------+-----------------+-----------+----------+

They can also be found in Clinvar but mapped to a slightly different term:

+---------------+-----------+-----------------------------+-----------+------------------------------+
|targetId       |diseaseId  |diseaseFromSource            |variantRsId|literature                    |
+---------------+-----------+-----------------------------+-----------+------------------------------+
|ENSG00000138246|EFO_0002508|parkinson disease, late-onset|rs387907571|[20301402, 20482602, 24218364]|
+---------------+-----------+-----------------------------+-----------+------------------------------+
d0choa commented 3 years ago

After discussion with @ireneisdoomed and @DSuveges, we agreed the value of adding ~20% not captured variants exceeds the price of adding duplicates. For this reason, we will be rescuing the uniprot datasource aiming at 20.04 (new evidence) for the first release affected.