Open RichardBruskiewich opened 1 year ago
@pnrobinson, @cmungall, @putmantime, @kevinschaper @matentzn ... I've 'assigned' you to this issue for the moment, simply to flag the issue for your kind feedback and augmentation.
I am otherwise initiating the review of the Phenol code (Peter, as I have questions about the code base, I'll coordinate with you and Daniel for guidance).
One ancient related issue (in the icebox): https://github.com/monarch-initiative/monarch-ingest/issues/251
Closely related to https://github.com/monarch-initiative/omim/issues/80
@RichardBruskiewich @matentzn offered to give you an overview of the Exomiser/Koza/Mondo situation regarding g2d. We'd like to have a data call after this review process is complete to come up with and schedule the work for a generalized solution.
@matentzn and @putmantime, thank you for the meeting on the 16th March 2023, to discuss this task and formulate a plan for its resolution. Briefly:
Study and document all the ways that OMIM and Orphanet are being processed within various code bases hosted by Monarch, to guide the creation of a more comprehensive Koza ingest for of a more normalized set of Gene-to-Disease (G2D) and Phenotype-to-Disease (P2D) subject-relationship (predicate) - object associations for the Monarch Graph.
Goal: The Monarch team is attempting to capture all the processes for G2D and P2D (specifically, OMIM and Orphanet data) capture across Monarch, to identify how it is currently being done, to clarify provenance of knowledge to allow easier comparative analyses, and create a comprehensive G2D and P2D ingest for Monarch.
To meet this goal, an inventory of existing Monarch-hosted (or used) project 'solutions' that have some component of parsing OMIM and Orphanet information into G2D and P2D subject-predicate-object associations will be reviewed. A tentative list of such 'solutions' is already compiled in the task plan (although more may be added if necessary) with identified "application experts" listed alongside. This list current includes the following Monarch-affiliated applications: Exomizer, Phenol, HPOQC, MONDO OMIM ingest, Dipper and Koza itself.
We will conduct a basic self-study of each 'solution' code base, with the aim of composing a basic architecture and data flow diagram, with brief supporting notes, to serve as a conversation piece with the "application experts" guiding the capture of suitable descriptions of each application with respect to the objective of capturing G2D and P2D associations.
A common interview script of questions is formulated to be posed to each such "application expert" to drive the compilation of software and data characteristics of each application, and includes a request for (sample) 'dumps' of files containing data relating to G2D and P2D associations. An approximately 1 hour interview based on the script will be scheduled and convened with each identified application expert, to correct/refine the aforementioned application architecture and data flow diagram and document additional information relevant to the task goal.
The resolution of this issue will be the documented answers to the aforementioned questions, the data dumps requested, and a first-order comparison of these applications and their data dumps against one another, to guide future Monarch G2D and P2D association Koza ingest design and implementation. These deliverables will be hosted in a secure Monarch private storage bucket for further Monarch team assessment.
@madanucd this ticket may be of help to your G2D ingest assessment.
@sagehrke I'm not that sure what to make of this exercise now after all the discussions some many months ago. We had a "70% solution" but not sure what comes next.
Perhaps @madanucd and @kevinschaper can connect with you, @RichardBruskiewich, to see what next steps are regarding G2D review and any potential updates to ingest mappings.
Given that my Monarch subaward budget is depleted, I can no longer contribute to the resolution of this issue.
Related to monarch-initiative/monarch-app#707
phenol and hpoannotQC should be considered the source of truth. This pipeline outputs phenotype.hpoa, which does not have genetic data. Other parts of phenol combine the genetic data and this is used for the HPO website and API. THe latter has been recently reworded by Mike and could provide a more unified view on several ontologies and could be more easily adapted for Monarch (e.g., uberon, Mondo, Maxo browsers).
Monarch graph has ingested HPOA (OMIM, Orphanet, MorbidMap, etc.) mappings but these have some subtle issues of precision and completeness, and appear generated from secondary data sources that have challenging semantics. More importantly, the Monarch Initiative (and other related projects) have spawned numerous additional code bases, highly overlapping but also heterogeneous in design to one another, for example:
Closely related to the G2D mapping task are the underlying disease and phenotype ontology efforts:
This issue has the goal of a compare and contrast (tabular?) review of relevant G2D input data parsing code bases to identify a common normalized (singular) approach for the ingest of Monarch knowledge graph G2D mappings. This would aim to characterize the following for each reviewed code library:
Reviews Archive
https://drive.google.com/drive/folders/1ob6BiPuVcVGyO7kkNfTHjfoxGXAPbc5m