Open DSuveges opened 5 days ago
vcf_string=1
), which is required to normalise indel variants.The usual disease mapping pipeline is applied as we use for other parsers:
from common.ontology import add_efo_mapping
mapped_df = add_efo_mapping(unmapped_df, spark, '.')
These comparisons expects valid EFO mappings:
Let's see P05067 vs OMIM:605714. There are four rsids as evidence for this association on the uniprot page. So assuming perfect mapping, it would mean one association and 4 evidence. This is exactly what we see in the new pipeline:
+------------------+----------------------------------------+-------------------+-----------+-------------------------+----------------------------------------+
|targetFromSourceId|diseaseFromSource |diseaseFromSourceId|variantRsId|diseaseFromSourceMappedId|name |
+------------------+----------------------------------------+-------------------+-----------+-------------------------+----------------------------------------+
|P05067 |Cerebral amyloid angiopathy, APP-related|OMIM:605714 |rs63750579 |MONDO_0011583 |cerebral amyloid angiopathy, APP-related|
|P05067 |Cerebral amyloid angiopathy, APP-related|OMIM:605714 |rs63750921 |MONDO_0011583 |cerebral amyloid angiopathy, APP-related|
|P05067 |Cerebral amyloid angiopathy, APP-related|OMIM:605714 |rs63749810 |MONDO_0011583 |cerebral amyloid angiopathy, APP-related|
+------------------+----------------------------------------+-------------------+-----------+-------------------------+----------------------------------------+
However there are 32 evidence in the old pipeline because there's an 8x explosion as that pipeline maps the disease to 8 EFOs:
+------------------+----------------------------------------+-------------------+-------------------------+--------------------------------------------------------------+
|targetFromSourceId|diseaseFromSource |diseaseFromSourceId|diseaseFromSourceMappedId|name |
+------------------+----------------------------------------+-------------------+-------------------------+--------------------------------------------------------------+
|P05067 |Cerebral amyloid angiopathy, APP-related|OMIM:605714 |Orphanet_324713 |Hereditary cerebral hemorrhage with amyloidosis, Italian type |
|P05067 |Cerebral amyloid angiopathy, APP-related|OMIM:605714 |Orphanet_100006 |Hereditary cerebral hemorrhage with amyloidosis, Dutch type |
|P05067 |Cerebral amyloid angiopathy, APP-related|OMIM:605714 |Orphanet_324718 |Hereditary cerebral hemorrhage with amyloidosis, Flemish type |
|P05067 |Cerebral amyloid angiopathy, APP-related|OMIM:605714 |MONDO_0011583 |cerebral amyloid angiopathy, APP-related |
|P05067 |Cerebral amyloid angiopathy, APP-related|OMIM:605714 |Orphanet_324708 |Hereditary cerebral hemorrhage with amyloidosis, Iowa type |
|P05067 |Cerebral amyloid angiopathy, APP-related|OMIM:605714 |Orphanet_324703 |Hereditary cerebral hemorrhage with amyloidosis, Piedmont type|
|P05067 |Cerebral amyloid angiopathy, APP-related|OMIM:605714 |Orphanet_324723 |Hereditary cerebral hemorrhage with amyloidosis, Arctic type |
|P05067 |Cerebral amyloid angiopathy, APP-related|OMIM:605714 |Orphanet_85458 |Hereditary cerebral hemorrhage with amyloidosis |
+------------------+----------------------------------------+-------------------+-------------------------+--------------------------------------------------------------+
When looking at disease/target pairs in the source, there are only 38 pairs that were not mapped to EFO by the new pipeline. Some mapping seems to be relevant, however a number of mappings are not found in the EFO slim, that our disease index is based on (name
is null
) :
+-------------------------+-------------------+-------------------------------------------------------------+-----------------------------------------------------------------------+
|diseaseFromSourceMappedId|diseaseFromSourceId|diseaseFromSource |name |
+-------------------------+-------------------+-------------------------------------------------------------+-----------------------------------------------------------------------+
|MONDO_0011875 |OMIM:607628 |Epilepsy, idiopathic generalized 11 |null |
|MONDO_0008490 |OMIM:184840 |Otospondylomegaepiphyseal dysplasia, autosomal dominant |otospondylomegaepiphyseal dysplasia, autosomal dominant |
|Orphanet_166100 |OMIM:184840 |Otospondylomegaepiphyseal dysplasia, autosomal dominant |Stickler syndrome type 3 |
|EFO_0009080 |OMIM:308100 |Ichthyosis, X-linked |x-linked ichthyosis with steryl-sulfatase deficiency |
|MONDO_0010622 |OMIM:308100 |Ichthyosis, X-linked |recessive X-linked ichthyosis |
|MONDO_0013568 |OMIM:614090 |Sick sinus syndrome 3 |null |
|MONDO_0012161 |OMIM:608957 |Immunodeficiency 116 |null |
|MONDO_0011650 |OMIM:606217 |Atrioventricular septal defect 2 |null |
|MONDO_0011652 |OMIM:606232 |Phelan-McDermid syndrome |Phelan-McDermid syndrome |
|Orphanet_48652 |OMIM:606232 |Phelan-McDermid syndrome |Monosomy 22q13 |
|MONDO_0859376 |OMIM:620241 |Hydrocephalus, congenital, 5 |null |
|MONDO_0013957 |OMIM:614893 |Immunodeficiency 32A |null |
|MONDO_0011875 |OMIM:607628 |Juvenile absence epilepsy 2 |null |
|MONDO_0859316 |OMIM:620121 |Iron overload |null |
|MONDO_0012843 |OMIM:612269 |Epilepsy, childhood absence 5 |null |
|MONDO_0008633 |OMIM:191900 |Muckle-Wells syndrome |Muckle-Wells syndrome |
|MONDO_0044315 |OMIM:617439 |Craniosynostosis 7 |null |
|MONDO_0010389 |OMIM:300645 |Immunodeficiency 34 |null |
|MONDO_0011776 |OMIM:607115 |Chronic infantile neurologic cutaneous and articular syndrome|CINCA syndrome |
|MONDO_0008693 |OMIM:200110 |Ablepharon-macrostomia syndrome |ablepharon macrostomia syndrome |
|MONDO_0012670 |OMIM:611451 |Deafness, autosomal recessive, 63 |autosomal recessive nonsyndromic hearing loss 63 |
|MONDO_0009288 |OMIM:232240 |Glycogen storage disease 1C |glycogen storage disease Ib |
|Orphanet_79259 |OMIM:232240 |Glycogen storage disease 1C |Glycogen storage disease due to glucose-6-phosphatase deficiency type b|
|Orphanet_364 |OMIM:232240 |Glycogen storage disease 1C |Glycogen storage disease due to glucose-6-phosphatase deficiency |
|MONDO_0011163 |OMIM:601887 |Malignant hyperthermia 5 |null |
|MONDO_0009288 |OMIM:232220 |Glycogen storage disease 1B |glycogen storage disease Ib |
|Orphanet_364 |OMIM:232220 |Glycogen storage disease 1B |Glycogen storage disease due to glucose-6-phosphatase deficiency |
|Orphanet_79259 |OMIM:232220 |Glycogen storage disease 1B |Glycogen storage disease due to glucose-6-phosphatase deficiency type b|
|MONDO_0011875 |OMIM:607628 |Juvenile myoclonic epilepsy 8 |null |
|MONDO_0008856 |OMIM:209950 |Immunodeficiency 27A |null |
|MONDO_0008853 |OMIM:209885 |Barber-Say syndrome |Barber-Say syndrome |
|MONDO_0013955 |OMIM:614891 |Immunodeficiency 30 |null |
|MONDO_0010576 |OMIM:304400 |Deafness, X-linked, 2 |X-linked mixed hearing loss with perilymphatic gusher |
|Orphanet_383 |OMIM:304400 |Deafness, X-linked, 2 |X-linked mixed deafness with perilymphatic gusher |
|MONDO_0013956 |OMIM:614892 |Immunodeficiency 31A |null |
|MONDO_0030334 |OMIM:619441 |Encephalitis, acute, infection (viral)-induced, 11 |null |
|EFO_0004190 |OMIM:609887 |Glaucoma 1, open angle, G |open-angle glaucoma |
|MONDO_0012141 |OMIM:608864 |Non-syndromic orofacial cleft 6 |null |
|MONDO_0030004 |OMIM:618830 |Autism 20 |null |
|MONDO_0013498 |OMIM:613950 |Schizophrenia 15 |null |
|MONDO_0011159 |OMIM:601868 |Deafness, autosomal dominant, 13 |null |
|MONDO_0009335 |OMIM:235400 |Hemolytic uremic syndrome, atypical, 1 |null |
|MONDO_0007349 |OMIM:120100 |Familial cold autoinflammatory syndrome 1 |familial cold autoinflammatory syndrome 1 |
|Orphanet_47045 |OMIM:120100 |Familial cold autoinflammatory syndrome 1 |Familial cold urticaria |
|MONDO_0014710 |OMIM:616622 |Immunodeficiency 42 |null |
|MONDO_0044206 |OMIM:215150 |Otospondylomegaepiphyseal dysplasia, autosomal recessive |otospondylomegaepiphyseal dysplasia, autosomal recessive |
|MONDO_0007849 |OMIM:148200 |Keratoendothelitis fugax hereditaria |keratitis fugax hereditaria |
+-------------------------+-------------------+-------------------------------------------------------------+-----------------------------------------------------------------------+
Altogether, I'm happy with the performance compared to the existing pipeline.
Uniprot provides a curated set of naturally occurring, protein coding variation that are involved in diseases. This dataset has already been captured by the
uniprot_variants
dataset produced by the parser developed by the Uniprot team. However, recent development in the platform and the upcoming integration of the genetics product made it necessary to reconsider the evidence generation process and the data model.Consideration
elevated kinase activity; efficiently induces cell transformation
is not parsed all evidence is annotated with the constant:"targetModulation":"up_or_down"
)this entry may act as a disease modifier
would imply a weaker confidence)TODOs