Generate new Mondo files for rare disease drug repurposing

mellybelly commented 5 months ago

We need to create some mondo subset files for use in KGs along the following lines:

Rare disease subset
Rare disease subset with leaf nodes representing variants in same gene rolled up to parent term
All genetic diseases
Rare disease subset where all diseases known to be related to the same Reactome or KEGG pathway or RHEA reaction are annotated as such
All diseases without known treatments (stretch goal)

This will require querying the MOnarch KG and/or other KGs, but the idea would be to release these files as part of Mondo release package moving forward. This ticket is intended as a starting place, feedback welcome.

matentzn commented 5 months ago

These are a lot of different subsets. Before we think about actually making specific files, lets make sure we understand the use cases correctly:

Rare disease subset with leaf nodes representing variants in same gene rolled up to parent term

I am assuming @sabrinatoro can explain what this exactly means when she is back.

leaf nodes representing variants (you mean, diseases that are variant-specific?)
You want to propagate variant associations from a child disease up to its parent? That is an annotation roll up and does not belong in an ontology (due to its all-some semantics). Seems you are asking for a KG extract here?

Rare disease subset where all diseases known to be related to the same Reactome or KEGG pathway or RHEA reaction are annotated as such

Simple KG query, not suitable as a Mondo subset I think - this could be delivered as a KG component if it has value to be distributed in isolation.

I think this ticket should be moved to the monarch-kg repo, but we can leave it hear to shepherd it for the time being.

mellybelly commented 5 months ago

RE first question: leaf nodes that are subclasses of a disease with a single gene linked, and the subclasses are the variants in that same gene.

As to where to put the ticket, we need the files released as mondo files, so the task is both KG and mondo release processes, happy to move or split the ticket though.

twhetzel commented 5 months ago

Since Mondo only includes gene annotations on diseases when caused by a single gene, will any additional modeling be added into Mondo to include gene annotations for diseases where >1 gene is involved?

mellybelly commented 5 months ago

Since Mondo only includes gene annotations on diseases when caused by a single gene, will any additional modeling be added into Mondo to include gene annotations for diseases where >1 gene is involved?

this is in the monarch KG, not in mondo at present. this is why its confusing about where this ticket should be :-).

monicacecilia commented 5 months ago

Yes, this ticket belongs in monarch-app, as it requests querying the KG. I'll transfer it. The outputs of the KG query can be disseminated with the Mondo monthly releases. We should make a separate ticket about it in the Mondo repo. @matentzn, can you please do that?

matentzn commented 5 months ago

I will open an issue once I understand everything better - I have reached out to @mellybelly on slack to get some context on this ticket. The way I read it, there is no real need for pushing this content out as part of a Mondo subset; I am sure I am just misunderstanding something, and when it becomes clear, I will make sure its done..

matentzn commented 5 months ago

Context from @sabrinatoro :

MATRIX project (drug repurposing)
Monarch-Robocop reconciliation
We want to start from a list of diseases with all of their associations,
Mondo subset for all diseases that is defined by a pathway
Monarch KG annotation file with all associations for these
Tricky thing: what kind of subset is useful?
Interesting
- Diseases that share the same gene
- Diseases that share the same pathway (candidate for repurposing drugs)

pascalwhoop commented 5 months ago

Hey all, maybe this helps?

In terms of deliverables we are looking for 2 things:

2 datasets for "target drugs & target diseases"
the code to generate them based on e.g. the mondo KG

The dataset doesn't need to be perfect if we can iterate on the code because we can then re-execute as often as we want. For our codebase we're setting a few standards which we noted down in our Tech tools page:

have the ingestion be a pure python function using e.g. requests and returning a pandas dataframe of the upstream dataset. This can also be a subprocess call to a CLI tool but that's less clean.
have a secondary python functions process this dataframe to produce the "drugs" and "diseases" datasets respectively
integrate those into our kedro pipeline (code)

This way we can have the code be portable and kedro's catalog then allows us to define whether to store the datasets in cloud buckets, locally or in a SQL/graph DB. An example kedro code is here https://github.com/everycure-org/matrix/tree/main/pipelines/matrix/src/matrix/pipelines/integration and the catalog is here https://github.com/everycure-org/matrix/blob/main/pipelines/matrix/conf/base/integration/catalog.yml

in terms of data format, we usually go for parquet since it's well compressed and easily picked up by BigQuery and thus we can make the data more accessible to analysts as we iterate.

For columns, MVP is the IDs used in RTX-KG2 (MONDO:...) but of course it would be good to add some metadata to that, e.g. common name, description etc. whatever is easy to do and not too wasteful.

monarch-initiative / monarch-app

Generate new Mondo files for rare disease drug repurposing #747