neurogenomics / RareDiseasePrioritisation

Prioritise cell-type-specific gene targets from the Rare Disease Celltyping project.
1 stars 0 forks source link

Assess existence of experimental models #33

Closed bschilder closed 8 months ago

bschilder commented 11 months ago

Assess whether there is an existing experimental model for each candidate therapeutics target.

We can check this by seeing if there is an MPO or UPHENO annotation for the same phenotype.

bschilder commented 11 months ago

Found a treasure trove of data on experimental models for diseases (and perhaps specific phenotypes) on Monarch: https://data.monarchinitiative.org/latest/tsv/model_associations/

However, these files don't include gene-level info (which we would want if we have a particular gene therapy target in mind), but I'm checking to see if there's a way I can extract that from the larger Monarch knowledge graph: https://data.monarchinitiative.org/monarch-kg/latest/

They also only provide MONDO ID's for each disease, so I need to find an effective way to map these back to the HPO/OMIM/DECIPHER/ORPH IDs provided by HPO. I've reached out to the MONDO ontology creators as well:

NathanSkene commented 11 months ago

What's argument against just using Mammalian Phenotype Ontology overlap?

Also, here's some of the messages we sent relating to this previously:

Here's one of the gene's that a mouse model for respiratory failure: http://www.informatics.jax.org/reference/J:120296

Here's the list of mammalian phenotype ontology genes (for respiratory failure): http://www.informatics.jax.org/mp/annotations/MP:0001953 (edited)

Gene therapy for ABCA3 in respiratory failure is already being looked into: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8798122/

bschilder commented 11 months ago

What's argument against just using Mammalian Phenotype Ontology overlap?

Several reasons:

NathanSkene commented 11 months ago

Sounds good!

Sent from Outlook for iOShttps://aka.ms/o0ukef


From: Brian M. Schilder @.> Sent: Thursday, November 30, 2023 1:24:56 PM To: neurogenomics/RareDiseasePrioritisation @.> Cc: Skene, Nathan G @.>; Comment @.> Subject: Re: [neurogenomics/RareDiseasePrioritisation] Assess existence of experimental models (Issue #33)

This email from @.*** originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders listhttps://spam.ic.ac.uk/SpamConsole/Senders.aspx to disable email stamping for this address.

What's argument against just using Mammalian Phenotype Ontology overlap?

Several reasons:

— Reply to this email directly, view it on GitHubhttps://github.com/neurogenomics/RareDiseasePrioritisation/issues/33#issuecomment-1833781209, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AH5ZPEZIEVU2HSDYDGW4EDLYHCCKRAVCNFSM6AAAAAA7M2TMMWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZTG44DCMRQHE. You are receiving this because you commented.Message ID: @.***>

bschilder commented 11 months ago

Preliminary summary plot showing the proportion of orthologous genes overlapping between HPO and non-human ontology databases (within a given phenotype), repeated across many phenotypes:

Image

Will include this in the final report as well as showing how we can use this to prioritise gene/phenotype-specific therapeutic targets.

NathanSkene commented 11 months ago

Great, didn’t think about looking at zebrafish models etc as well!

Can you explain the x-axis?

bschilder commented 11 months ago

Great, didn’t think about looking at zebrafish models etc as well! Can you explain the x-axis?

Sure!

Dividing one over the other thus gives you the proportion of HPO gene annotations recapitulated in the equivalent phenotype of another species.

This proportion will be influenced by both evolutionary distance and how well studied each species is (notice the difference between mouse and rats, despite the fact that they're equally related to humans).

bschilder commented 11 months ago

Here's some gene therapy target phenotypes identified by our previous analyses. The exact phenotypes will likely change once we add chatGPT annotations to our filtering strategy with the round of enrichment results. But for now these can serve as an example.

with the heatmap colored by the "equivalence score", which is essentially UPHENO's way of quantifying how well a phenotype matches up across species (on a scale from 0-1). Data comes from here.

Currently the fuzzy equivalence score is the Jaccard similarity:

Not sure exactly on what basis they computed Jaccard similarity, but I'll look into this some more.

upheno_top_targets_heatmap.pdf

Looks like UPHENO has been thinking about adding fly ontology mappings as well, though there hasn't been any activity on this since 2016 it seems. Just pinged them to get an update:

bschilder commented 11 months ago

Currently the fuzzy equivalence score is the Jaccard similarity Not sure exactly on what basis they computed Jaccard similarity, but I'll look into this some more.

This HPO publication, in which they did the mapping with Exomiser.

For example, Exomiser (15) leverages the semantic associations between HPO, MP and ZP to prioritize variants effectively by matching human phenotypic abnormalities with phenotypes observed in animal models with knockouts of genes orthologous to human disease-associated genes.

Though this figure suggests there's also already mapping between fly and frog as well. I'll reach out to the HPO team to confirm where i might find this, and to confirm the methodology they used to do the phenotype mapping:

Image

matentzn commented 11 months ago

@bschilder would you be up for a quick call on the matter? I will sort you out with fuzzy and proper matches as well.

bschilder commented 11 months ago

@bschilder would you be up for a quick call on the matter? I will sort you out with fuzzy and proper matches as well.

Absolutely! Thank you so much for reaching out! Setting up a time for us to meet.

bschilder commented 11 months ago

Met with @matentzn who was extremely helpful in explaining the cross-species phenotype matching procedure to me, and pointing me to some additional resources.

For mapping MONDO IDs in the Monarch model's file, I'm switching to using this file as it avoid issues observed here:

With these changes, HPOExplorer can now map >90% of MONDO ids listed in the model file to OMIM IDs:

library(HPOExplorer)
>  model <- get_monarch("disease_to_model")
 [100%] Downloaded 883280 bytes...
>   model$db <- stringr::str_split(model$subject,":", simplify = TRUE)[,1]
>   model <- map_mondo(dat = model,
+                      input_col="object",
+                      output_col="OMIM_ID",
+                      to=c("OMIM","Orphanet"))
 [100%] Downloaded 1082741 bytes...
476 / 5,154 (9.24%) OMIM_ID missing.

The only issue is, as far as I can tell MONDO doesn't seem to contain any mappings between MONDO IDs and DECIPHER IDs. DECIPHER IDs only make up a small fraction of the HPO annotations, but would be nice to have a complete mapping nonetheless:

> phenos <- make_phenos_dataframe(add_disease_data = TRUE)
> phenos$disease_db <- stringr::str_split(phenos$disease_id,":", simplify = TRUE)[,1]
>  table(phenos$disease_db)

Screenshot 2023-12-08 at 23 12 22

bschilder commented 11 months ago

To summarise, the phenotype matching procedure is meant to captured semantic similarity using a semi-heuristic model (a combination of explicit rules and data-driven). Data inputs come from a variety of sources. Ultimately, they linking together concepts (species, diseases, phenotypes, genes, pathways, etc.) in a knowledge graph derived from a mix of NLP queries to the published literature and other database.

@matentzn this is probably a poor attempt to explain this properly, but if there's a paper or docs page you could point me to that would be quite helpful! Thanks!

matentzn commented 11 months ago

DECIPHER

We have this for DECIPHER: https://github.com/monarch-initiative/mondo/blob/master/src/ontology/mappings/mondo_hasdbxref_decipher.sssom.tsv

Which will do the job for you!

To summarise, the phenotype matching procedure is meant to captured semantic similarity using a semi-heuristic model (a combination of explicit rules and data-driven). Data inputs come from a variety of sources. Ultimately, they linking together concepts (species, diseases, phenotypes, genes, pathways, etc.) in a knowledge graph derived from a mix of NLP queries to the published literature and other database.

Its simpler than that.

  1. We generate phenotypic profiles from ontologies, using jaccard similarity usually over the hierarchical relations in the ontology and information content for the reranking
  2. Cool Paper: https://www.osti.gov/biblio/1625303 with background
  3. The current "bestmatches" include a mix of logical and simple lexical matches and are hugely out of date (I would not use them in production, but they are probably "not wrong"

I requested an FBcv profile for you here: https://github.com/monarch-initiative/monarch-semantic-similarity-profiles/issues/16

So you can take a look how it looks like.

bschilder commented 11 months ago

DECIPHER

We have this for DECIPHER: https://github.com/monarch-initiative/mondo/blob/master/src/ontology/mappings/mondo_hasdbxref_decipher.sssom.tsv

Which will do the job for you!

Ah, amazing! I had totally missed that bc i was using this file, which I assumed included all the other ones: https://github.com/monarch-initiative/mondo/blob/master/src/ontology/mappings/mondo.sssom.tsv

I've implemented many of these functions within a new package for accessing/processing knowledge graphs in general (HPOExplorer was getting to bloated): https://github.com/neurogenomics/KGExplorer/blob/29eccbbd33fd18d9ce85b0ae72b47d485d97faee/R/map_upheno_data_i.R

I was also just alerted to the monarchr package, which may extract much of the info i need more efficiently than I am now (which relies mostly on TSV downloads).

I've also begun exploring some of the graph query resources/tools you alerted to me on our call:

To summarise, the phenotype matching procedure is meant to captured semantic similarity using a semi-heuristic model (a combination of explicit rules and data-driven). Data inputs come from a variety of sources. Ultimately, they linking together concepts (species, diseases, phenotypes, genes, pathways, etc.) in a knowledge graph derived from a mix of NLP queries to the published literature and other database.

Its simpler than that.

  1. We generate phenotypic profiles from ontologies, using jaccard similarity usually over the hierarchical relations in the ontology and information content for the reranking
  2. Cool Paper: https://www.osti.gov/biblio/1625303 with background
  3. The current "bestmatches" include a mix of logical and simple lexical matches and are hugely out of date (I would not use them in production, but they are probably "not wrong"

Ahhh, this makes so much more sense now! Thanks for explaining that in more detail, and for the paper (super interesting work!). Along those lines, I've found the rphenoscape package useful for computing cross-ontology similarity matrices on the go.

I requested an FBcv profile for you here: monarch-initiative/monarch-semantic-similarity-profiles#16

So you can take a look how it looks like.

Thank you so much! I really appreciate this, and all your other help.