monarch-initiative / monarch-app

Monarch Initiative website and API
https://monarchinitiative.org/
BSD 3-Clause "New" or "Revised" License
16 stars 3 forks source link

Generate new Mondo files for rare disease drug repurposing #747

Open mellybelly opened 1 week ago

mellybelly commented 1 week ago

We need to create some mondo subset files for use in KGs along the following lines:

This will require querying the MOnarch KG and/or other KGs, but the idea would be to release these files as part of Mondo release package moving forward. This ticket is intended as a starting place, feedback welcome.

matentzn commented 1 week ago

These are a lot of different subsets. Before we think about actually making specific files, lets make sure we understand the use cases correctly:

Rare disease subset with leaf nodes representing variants in same gene rolled up to parent term

I am assuming @sabrinatoro can explain what this exactly means when she is back.

Rare disease subset where all diseases known to be related to the same Reactome or KEGG pathway or RHEA reaction are annotated as such

Simple KG query, not suitable as a Mondo subset I think - this could be delivered as a KG component if it has value to be distributed in isolation.

I think this ticket should be moved to the monarch-kg repo, but we can leave it hear to shepherd it for the time being.

mellybelly commented 1 week ago

RE first question: leaf nodes that are subclasses of a disease with a single gene linked, and the subclasses are the variants in that same gene.

As to where to put the ticket, we need the files released as mondo files, so the task is both KG and mondo release processes, happy to move or split the ticket though.

twhetzel commented 1 week ago

Since Mondo only includes gene annotations on diseases when caused by a single gene, will any additional modeling be added into Mondo to include gene annotations for diseases where >1 gene is involved?

mellybelly commented 1 week ago

Since Mondo only includes gene annotations on diseases when caused by a single gene, will any additional modeling be added into Mondo to include gene annotations for diseases where >1 gene is involved?

this is in the monarch KG, not in mondo at present. this is why its confusing about where this ticket should be :-).

monicacecilia commented 1 week ago

Yes, this ticket belongs in monarch-app, as it requests querying the KG. I'll transfer it. The outputs of the KG query can be disseminated with the Mondo monthly releases. We should make a separate ticket about it in the Mondo repo. @matentzn, can you please do that?

matentzn commented 1 week ago

I will open an issue once I understand everything better - I have reached out to @mellybelly on slack to get some context on this ticket. The way I read it, there is no real need for pushing this content out as part of a Mondo subset; I am sure I am just misunderstanding something, and when it becomes clear, I will make sure its done..

matentzn commented 1 week ago

Context from @sabrinatoro :

pascalwhoop commented 5 days ago

Hey all, maybe this helps?

In terms of deliverables we are looking for 2 things:

  1. 2 datasets for "target drugs & target diseases"
  2. the code to generate them based on e.g. the mondo KG

The dataset doesn't need to be perfect if we can iterate on the code because we can then re-execute as often as we want. For our codebase we're setting a few standards which we noted down in our Tech tools page:

This way we can have the code be portable and kedro's catalog then allows us to define whether to store the datasets in cloud buckets, locally or in a SQL/graph DB. An example kedro code is here https://github.com/everycure-org/matrix/tree/main/pipelines/matrix/src/matrix/pipelines/integration and the catalog is here https://github.com/everycure-org/matrix/blob/main/pipelines/matrix/conf/base/integration/catalog.yml

in terms of data format, we usually go for parquet since it's well compressed and easily picked up by BigQuery and thus we can make the data more accessible to analysts as we iterate.

For columns, MVP is the IDs used in RTX-KG2 (MONDO:...) but of course it would be good to add some metadata to that, e.g. common name, description etc. whatever is easy to do and not too wasteful.