RichardBruskiewich commented 1 year ago

Monarch graph has ingested HPOA (OMIM, Orphanet, MorbidMap, etc.) mappings but these have some subtle issues of precision and completeness, and appear generated from secondary data sources that have challenging semantics. More importantly, the Monarch Initiative (and other related projects) have spawned numerous additional code bases, highly overlapping but also heterogeneous in design to one another, for example:

Monarch OMIM parser
DIpper OMIM parser
Exomiser: OrphanetDiseaseGeneFactory and OmimGeneMap2Reader plus some
Monarch Ingest OMIM parser
HPO Annotation QC
Phenol: Ontology Library for Phenomics and Genomics
(additional overlapping code bases as may be identified along the way...)

Closely related to the G2D mapping task are the underlying disease and phenotype ontology efforts:

This issue has the goal of a compare and contrast (tabular?) review of relevant G2D input data parsing code bases to identify a common normalized (singular) approach for the ingest of Monarch knowledge graph G2D mappings. This would aim to characterize the following for each reviewed code library:

[ ] Enumeration and general review of the composition of G2D-related input (knowledge) data files which are parsed by the library
[ ] Parsing heuristics ('rules') and algorithms internally encoded by the library
[ ] Enumeration and description of library output formats
[ ] Review of possible output formats (e.g. TSV?) for the Monarch KG construction pipeline, which could be added to the given library, to allow for optimal and complete capture of gene-to-disease knowledge capture (from OMIM, Orphanet, etc.) within the Monarch knowledge graphs
[ ] Review and highlight the relationship of library to MONDO and HPO.

Reviews Archive

https://drive.google.com/drive/folders/1ob6BiPuVcVGyO7kkNfTHjfoxGXAPbc5m

RichardBruskiewich commented 1 year ago

@pnrobinson, @cmungall, @putmantime, @kevinschaper @matentzn ... I've 'assigned' you to this issue for the moment, simply to flag the issue for your kind feedback and augmentation.

I am otherwise initiating the review of the Phenol code (Peter, as I have questions about the code base, I'll coordinate with you and Daniel for guidance).

RichardBruskiewich commented 1 year ago

One ancient related issue (in the icebox): https://github.com/monarch-initiative/monarch-ingest/issues/251

matentzn commented 1 year ago

putmantime commented 1 year ago

@RichardBruskiewich @matentzn offered to give you an overview of the Exomiser/Koza/Mondo situation regarding g2d. We'd like to have a data call after this review process is complete to come up with and schedule the work for a generalized solution.

RichardBruskiewich commented 1 year ago

@matentzn and @putmantime, thank you for the meeting on the 16th March 2023, to discuss this task and formulate a plan for its resolution. Briefly:

Study and document all the ways that OMIM and Orphanet are being processed within various code bases hosted by Monarch, to guide the creation of a more comprehensive Koza ingest for of a more normalized set of Gene-to-Disease (G2D) and Phenotype-to-Disease (P2D) subject-relationship (predicate) - object associations for the Monarch Graph.
Goal: The Monarch team is attempting to capture all the processes for G2D and P2D (specifically, OMIM and Orphanet data) capture across Monarch, to identify how it is currently being done, to clarify provenance of knowledge to allow easier comparative analyses, and create a comprehensive G2D and P2D ingest for Monarch.
To meet this goal, an inventory of existing Monarch-hosted (or used) project 'solutions' that have some component of parsing OMIM and Orphanet information into G2D and P2D subject-predicate-object associations will be reviewed. A tentative list of such 'solutions' is already compiled in the task plan (although more may be added if necessary) with identified "application experts" listed alongside. This list current includes the following Monarch-affiliated applications: Exomizer, Phenol, HPOQC, MONDO OMIM ingest, Dipper and Koza itself.
We will conduct a basic self-study of each 'solution' code base, with the aim of composing a basic architecture and data flow diagram, with brief supporting notes, to serve as a conversation piece with the "application experts" guiding the capture of suitable descriptions of each application with respect to the objective of capturing G2D and P2D associations.
A common interview script of questions is formulated to be posed to each such "application expert" to drive the compilation of software and data characteristics of each application, and includes a request for (sample) 'dumps' of files containing data relating to G2D and P2D associations. An approximately 1 hour interview based on the script will be scheduled and convened with each identified application expert, to correct/refine the aforementioned application architecture and data flow diagram and document additional information relevant to the task goal.
The resolution of this issue will be the documented answers to the aforementioned questions, the data dumps requested, and a first-order comparison of these applications and their data dumps against one another, to guide future Monarch G2D and P2D association Koza ingest design and implementation. These deliverables will be hosted in a secure Monarch private storage bucket for further Monarch team assessment.

sagehrke commented 11 months ago

@madanucd this ticket may be of help to your G2D ingest assessment.

RichardBruskiewich commented 9 months ago

@sagehrke I'm not that sure what to make of this exercise now after all the discussions some many months ago. We had a "70% solution" but not sure what comes next.

sagehrke commented 8 months ago

Perhaps @madanucd and @kevinschaper can connect with you, @RichardBruskiewich, to see what next steps are regarding G2D review and any potential updates to ingest mappings.

RichardBruskiewich commented 8 months ago

Given that my Monarch subaward budget is depleted, I can no longer contribute to the resolution of this issue.

sagehrke commented 8 months ago

Related to monarch-initiative/monarch-app#707

pnrobinson commented 4 months ago

phenol and hpoannotQC should be considered the source of truth. This pipeline outputs phenotype.hpoa, which does not have genetic data. Other parts of phenol combine the genetic data and this is used for the HPO website and API. THe latter has been recently reworded by Mike and could provide a more unified view on several ontologies and could be more easily adapted for Monarch (e.g., uberon, Mondo, Maxo browsers).

monarch-initiative / monarch-app

Upgrade of Gene to Disease ingest mappings #709

Reviews Archive