Aim 2.2.1 Create a unified data model for disease knowledge.

sagehrke commented 1 year ago

Summary

Create a unified data model for disease knowledge. There is a wealth of information that can be associated with any disease: what phenotypes are manifested, the causative variants and their effects on gene function, available treatments, exposures or environments that mitigate or exacerbate the disease. These associations can be qualified with modifiers such as severity, onset, etc. It is not enough to say a gene is associated with a disease; the nature of the association must be described using a controlled vocabulary or ontology. For example, does the disease arise because of a germline mutation in the gene or a somatic mutation, or do variants in one gene modify a disease caused by variants in another gene (“modifier gene”)? Probabilistic diagnostic and phenotype matching procedures make use of information such as the frequency of the phenotype in a disease (all else equal, if a phenotype is more commonly observed in disease A than disease B, then disease A is more probable than disease B in the differential diagnosis if that phenotype is observed [77]). Finally, capturing the evidence and provenance behind disease associations is critical to allow researchers and clinicians to evaluate the credibility and utility of this knowledge. We will create a data model that can be used to express these facets. We will first create a high-level conceptual model that can be easily visualized and communicated, basing this on the representation of disease in our BioLink model [78]. We will capture the provenance for assertions that rely on careful evidence evaluation, such as variant pathogenicity or gene validity interpretations, by utilizing standards such as the Scientific Evidence and Provenance Information Ontology (SEPIO) modeling framework being employed across ClinGen and the Global Alliance for Genomics and Health (GA4GH) communities [79]. We will then use this to derive different formats: including a simple tab-separated format, a JSON representation, and Protocol buffers [80]. We will also derive programming language-specific bindings for use in different applications.

People

@cmungall
Others

Key results for the end of the grant

[ ] A documentation page that explains how the "unified data model for disease knowledge" can be used
[ ] Some kind of agreement across Monarch to promote it?

Comments

The content of this box was provided by @matentzn

nlharris commented 1 year ago

I believe this is underway and will be significantly advanced by the LinkML / Phenomics First Common Data Model hackathon planned for March.

matentzn commented 1 year ago

I am not that sure about it.. Can we get a sense what concretely has been happening about this? I think we are cutting it pretty short on this one, but maybe I am just not clued in what is happening in other projects.

nlharris commented 1 year ago

I will discuss with @cmungall

cmungall commented 1 year ago

I think this is covered by

biolink model, for generalized disease knowledge
mondo design patterns

these are all progressing well

cmungall commented 11 months ago

I think we can call the MVP done here

matentzn commented 11 months ago

I think @cmungall this may be a bit too piecemeal - at the very least we should have a documentation page called "Unified Model of Disease Knowledge" and write how these various pieces fit together. Even I am a bit uncertain (not entirely uncertain), and we do not yet have comprehensive model. IMO, I dont even know what a "data model of disease knowledge" is really - "disease knowledge" is a terribly generic term.

In the description of the research proposal, we say (as example facets of the disease model):

germline mutation vs “modifier gene”
frequency of the phenotype in a disease
evidence and provenance behind disease associations is critical to allow researchers and clinicians to evaluate the credibility and utility of this knowledge

Also, non functional aspects:

"can be easily visualized and communicated" -> At least a page with a diagram

In my view of the task, we should either aim for a biolink subset that we can visualise and communicate independently, or a separate linkml model.

What is completely unclear right now: What is the point of such a model? It seems to me this is just a semi-arbitrary subset of the complete Monarch KG data model? Is this model about "modelling disease aspects in KGs"?

monarch-initiative / phenomics_first_resource