opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Improve disease/target evidence ingestion from uniprot #3459

Open DSuveges opened 5 days ago

DSuveges commented 5 days ago

Uniprot provides a curated set of naturally occurring, protein coding variation that are involved in diseases. This dataset has already been captured by the uniprot_variants dataset produced by the parser developed by the Uniprot team. However, recent development in the platform and the upcoming integration of the genetics product made it necessary to reconsider the evidence generation process and the data model.

Consideration

TODOs

DSuveges commented 5 days ago

SPARQL data retrieval

RsID to variant ID mapping

Disease mapping

The usual disease mapping pipeline is applied as we use for other parsers:

from common.ontology import add_efo_mapping

mapped_df = add_efo_mapping(unmapped_df, spark, '.')
DSuveges commented 5 days ago

Comparison with existing evidence set

These comparisons expects valid EFO mappings:

Conclusions:

Let's see P05067 vs OMIM:605714. There are four rsids as evidence for this association on the uniprot page. So assuming perfect mapping, it would mean one association and 4 evidence. This is exactly what we see in the new pipeline:

+------------------+----------------------------------------+-------------------+-----------+-------------------------+----------------------------------------+
|targetFromSourceId|diseaseFromSource                       |diseaseFromSourceId|variantRsId|diseaseFromSourceMappedId|name                                    |
+------------------+----------------------------------------+-------------------+-----------+-------------------------+----------------------------------------+
|P05067            |Cerebral amyloid angiopathy, APP-related|OMIM:605714        |rs63750579 |MONDO_0011583            |cerebral amyloid angiopathy, APP-related|
|P05067            |Cerebral amyloid angiopathy, APP-related|OMIM:605714        |rs63750921 |MONDO_0011583            |cerebral amyloid angiopathy, APP-related|
|P05067            |Cerebral amyloid angiopathy, APP-related|OMIM:605714        |rs63749810 |MONDO_0011583            |cerebral amyloid angiopathy, APP-related|
+------------------+----------------------------------------+-------------------+-----------+-------------------------+----------------------------------------+

However there are 32 evidence in the old pipeline because there's an 8x explosion as that pipeline maps the disease to 8 EFOs:

+------------------+----------------------------------------+-------------------+-------------------------+--------------------------------------------------------------+
|targetFromSourceId|diseaseFromSource                       |diseaseFromSourceId|diseaseFromSourceMappedId|name                                                          |
+------------------+----------------------------------------+-------------------+-------------------------+--------------------------------------------------------------+
|P05067            |Cerebral amyloid angiopathy, APP-related|OMIM:605714        |Orphanet_324713          |Hereditary cerebral hemorrhage with amyloidosis, Italian type |
|P05067            |Cerebral amyloid angiopathy, APP-related|OMIM:605714        |Orphanet_100006          |Hereditary cerebral hemorrhage with amyloidosis, Dutch type   |
|P05067            |Cerebral amyloid angiopathy, APP-related|OMIM:605714        |Orphanet_324718          |Hereditary cerebral hemorrhage with amyloidosis, Flemish type |
|P05067            |Cerebral amyloid angiopathy, APP-related|OMIM:605714        |MONDO_0011583            |cerebral amyloid angiopathy, APP-related                      |
|P05067            |Cerebral amyloid angiopathy, APP-related|OMIM:605714        |Orphanet_324708          |Hereditary cerebral hemorrhage with amyloidosis, Iowa type    |
|P05067            |Cerebral amyloid angiopathy, APP-related|OMIM:605714        |Orphanet_324703          |Hereditary cerebral hemorrhage with amyloidosis, Piedmont type|
|P05067            |Cerebral amyloid angiopathy, APP-related|OMIM:605714        |Orphanet_324723          |Hereditary cerebral hemorrhage with amyloidosis, Arctic type  |
|P05067            |Cerebral amyloid angiopathy, APP-related|OMIM:605714        |Orphanet_85458           |Hereditary cerebral hemorrhage with amyloidosis               |
+------------------+----------------------------------------+-------------------+-------------------------+--------------------------------------------------------------+
DSuveges commented 5 days ago

Comparing mappings with the previous pipeline

When looking at disease/target pairs in the source, there are only 38 pairs that were not mapped to EFO by the new pipeline. Some mapping seems to be relevant, however a number of mappings are not found in the EFO slim, that our disease index is based on (name is null) :

+-------------------------+-------------------+-------------------------------------------------------------+-----------------------------------------------------------------------+
|diseaseFromSourceMappedId|diseaseFromSourceId|diseaseFromSource                                            |name                                                                   |
+-------------------------+-------------------+-------------------------------------------------------------+-----------------------------------------------------------------------+
|MONDO_0011875            |OMIM:607628        |Epilepsy, idiopathic generalized 11                          |null                                                                   |
|MONDO_0008490            |OMIM:184840        |Otospondylomegaepiphyseal dysplasia, autosomal dominant      |otospondylomegaepiphyseal dysplasia, autosomal dominant                |
|Orphanet_166100          |OMIM:184840        |Otospondylomegaepiphyseal dysplasia, autosomal dominant      |Stickler syndrome type 3                                               |
|EFO_0009080              |OMIM:308100        |Ichthyosis, X-linked                                         |x-linked ichthyosis with steryl-sulfatase deficiency                   |
|MONDO_0010622            |OMIM:308100        |Ichthyosis, X-linked                                         |recessive X-linked ichthyosis                                          |
|MONDO_0013568            |OMIM:614090        |Sick sinus syndrome 3                                        |null                                                                   |
|MONDO_0012161            |OMIM:608957        |Immunodeficiency 116                                         |null                                                                   |
|MONDO_0011650            |OMIM:606217        |Atrioventricular septal defect 2                             |null                                                                   |
|MONDO_0011652            |OMIM:606232        |Phelan-McDermid syndrome                                     |Phelan-McDermid syndrome                                               |
|Orphanet_48652           |OMIM:606232        |Phelan-McDermid syndrome                                     |Monosomy 22q13                                                         |
|MONDO_0859376            |OMIM:620241        |Hydrocephalus, congenital, 5                                 |null                                                                   |
|MONDO_0013957            |OMIM:614893        |Immunodeficiency 32A                                         |null                                                                   |
|MONDO_0011875            |OMIM:607628        |Juvenile absence epilepsy 2                                  |null                                                                   |
|MONDO_0859316            |OMIM:620121        |Iron overload                                                |null                                                                   |
|MONDO_0012843            |OMIM:612269        |Epilepsy, childhood absence 5                                |null                                                                   |
|MONDO_0008633            |OMIM:191900        |Muckle-Wells syndrome                                        |Muckle-Wells syndrome                                                  |
|MONDO_0044315            |OMIM:617439        |Craniosynostosis 7                                           |null                                                                   |
|MONDO_0010389            |OMIM:300645        |Immunodeficiency 34                                          |null                                                                   |
|MONDO_0011776            |OMIM:607115        |Chronic infantile neurologic cutaneous and articular syndrome|CINCA syndrome                                                         |
|MONDO_0008693            |OMIM:200110        |Ablepharon-macrostomia syndrome                              |ablepharon macrostomia syndrome                                        |
|MONDO_0012670            |OMIM:611451        |Deafness, autosomal recessive, 63                            |autosomal recessive nonsyndromic hearing loss 63                       |
|MONDO_0009288            |OMIM:232240        |Glycogen storage disease 1C                                  |glycogen storage disease Ib                                            |
|Orphanet_79259           |OMIM:232240        |Glycogen storage disease 1C                                  |Glycogen storage disease due to glucose-6-phosphatase deficiency type b|
|Orphanet_364             |OMIM:232240        |Glycogen storage disease 1C                                  |Glycogen storage disease due to glucose-6-phosphatase deficiency       |
|MONDO_0011163            |OMIM:601887        |Malignant hyperthermia 5                                     |null                                                                   |
|MONDO_0009288            |OMIM:232220        |Glycogen storage disease 1B                                  |glycogen storage disease Ib                                            |
|Orphanet_364             |OMIM:232220        |Glycogen storage disease 1B                                  |Glycogen storage disease due to glucose-6-phosphatase deficiency       |
|Orphanet_79259           |OMIM:232220        |Glycogen storage disease 1B                                  |Glycogen storage disease due to glucose-6-phosphatase deficiency type b|
|MONDO_0011875            |OMIM:607628        |Juvenile myoclonic epilepsy 8                                |null                                                                   |
|MONDO_0008856            |OMIM:209950        |Immunodeficiency 27A                                         |null                                                                   |
|MONDO_0008853            |OMIM:209885        |Barber-Say syndrome                                          |Barber-Say syndrome                                                    |
|MONDO_0013955            |OMIM:614891        |Immunodeficiency 30                                          |null                                                                   |
|MONDO_0010576            |OMIM:304400        |Deafness, X-linked, 2                                        |X-linked mixed hearing loss with perilymphatic gusher                  |
|Orphanet_383             |OMIM:304400        |Deafness, X-linked, 2                                        |X-linked mixed deafness with perilymphatic gusher                      |
|MONDO_0013956            |OMIM:614892        |Immunodeficiency 31A                                         |null                                                                   |
|MONDO_0030334            |OMIM:619441        |Encephalitis, acute, infection (viral)-induced, 11           |null                                                                   |
|EFO_0004190              |OMIM:609887        |Glaucoma 1, open angle, G                                    |open-angle glaucoma                                                    |
|MONDO_0012141            |OMIM:608864        |Non-syndromic orofacial cleft 6                              |null                                                                   |
|MONDO_0030004            |OMIM:618830        |Autism 20                                                    |null                                                                   |
|MONDO_0013498            |OMIM:613950        |Schizophrenia 15                                             |null                                                                   |
|MONDO_0011159            |OMIM:601868        |Deafness, autosomal dominant, 13                             |null                                                                   |
|MONDO_0009335            |OMIM:235400        |Hemolytic uremic syndrome, atypical, 1                       |null                                                                   |
|MONDO_0007349            |OMIM:120100        |Familial cold autoinflammatory syndrome 1                    |familial cold autoinflammatory syndrome 1                              |
|Orphanet_47045           |OMIM:120100        |Familial cold autoinflammatory syndrome 1                    |Familial cold urticaria                                                |
|MONDO_0014710            |OMIM:616622        |Immunodeficiency 42                                          |null                                                                   |
|MONDO_0044206            |OMIM:215150        |Otospondylomegaepiphyseal dysplasia, autosomal recessive     |otospondylomegaepiphyseal dysplasia, autosomal recessive               |
|MONDO_0007849            |OMIM:148200        |Keratoendothelitis fugax hereditaria                         |keratitis fugax hereditaria                                            |
+-------------------------+-------------------+-------------------------------------------------------------+-----------------------------------------------------------------------+

Altogether, I'm happy with the performance compared to the existing pipeline.