Open mellybelly opened 5 months ago
These are a lot of different subsets. Before we think about actually making specific files, lets make sure we understand the use cases correctly:
Rare disease subset with leaf nodes representing variants in same gene rolled up to parent term
I am assuming @sabrinatoro can explain what this exactly means when she is back.
Rare disease subset where all diseases known to be related to the same Reactome or KEGG pathway or RHEA reaction are annotated as such
Simple KG query, not suitable as a Mondo subset I think - this could be delivered as a KG component if it has value to be distributed in isolation.
I think this ticket should be moved to the monarch-kg repo, but we can leave it hear to shepherd it for the time being.
RE first question: leaf nodes that are subclasses of a disease with a single gene linked, and the subclasses are the variants in that same gene.
As to where to put the ticket, we need the files released as mondo files, so the task is both KG and mondo release processes, happy to move or split the ticket though.
Since Mondo only includes gene annotations on diseases when caused by a single gene, will any additional modeling be added into Mondo to include gene annotations for diseases where >1 gene is involved?
Since Mondo only includes gene annotations on diseases when caused by a single gene, will any additional modeling be added into Mondo to include gene annotations for diseases where >1 gene is involved?
this is in the monarch KG, not in mondo at present. this is why its confusing about where this ticket should be :-).
Yes, this ticket belongs in monarch-app, as it requests querying the KG. I'll transfer it. The outputs of the KG query can be disseminated with the Mondo monthly releases. We should make a separate ticket about it in the Mondo repo. @matentzn, can you please do that?
I will open an issue once I understand everything better - I have reached out to @mellybelly on slack to get some context on this ticket. The way I read it, there is no real need for pushing this content out as part of a Mondo subset; I am sure I am just misunderstanding something, and when it becomes clear, I will make sure its done..
Context from @sabrinatoro :
Hey all, maybe this helps?
In terms of deliverables we are looking for 2 things:
The dataset doesn't need to be perfect if we can iterate on the code because we can then re-execute as often as we want. For our codebase we're setting a few standards which we noted down in our Tech tools page:
This way we can have the code be portable and kedro's catalog then allows us to define whether to store the datasets in cloud buckets, locally or in a SQL/graph DB. An example kedro code is here https://github.com/everycure-org/matrix/tree/main/pipelines/matrix/src/matrix/pipelines/integration and the catalog is here https://github.com/everycure-org/matrix/blob/main/pipelines/matrix/conf/base/integration/catalog.yml
in terms of data format, we usually go for parquet since it's well compressed and easily picked up by BigQuery and thus we can make the data more accessible to analysts as we iterate.
For columns, MVP is the IDs used in RTX-KG2 (MONDO:...
) but of course it would be good to add some metadata to that, e.g. common name, description etc. whatever is easy to do and not too wasteful.
We need to create some mondo subset files for use in KGs along the following lines:
This will require querying the MOnarch KG and/or other KGs, but the idea would be to release these files as part of Mondo release package moving forward. This ticket is intended as a starting place, feedback welcome.