Open robert-haas opened 3 months ago
I attach some subgraphs extracted from the five BMKGs that I mentioned as concrete examples: cml.zip
They represent the direct neighborhood of the disease CML, which is present in each BMKG and related to a drug named Imatinib, one of the first targeted cancer therapies.
I've exported the subgraphs in two formats:
Biomedical knowledge graphs (BMKGs)
Examples
As part of a DeepFunding round 3 project, I've created a repository about BMKGs that contains 1) a broad survey of what BMKGs are currently available from academic and commercial projects, 2) a narrowed down list of some that I found interesting, and 3) notebooks that inspect five projects closely, which have a relation to pharmacology & drug discovery, e.g. interesting for the task of drug repurposing or side-effect prediction.
The final goal of the project is to bring the KGs of these five projects to Hyperon, not just by a one-time conversion to MeTTa, but with a Python package that can fetch and convert the latest version of each KG whenever desired, and perhaps in multiple formats, maybe even user-customizable. For example, Monarch is a BMKG project that is updated roughly once per month to reflect the latest state of the source databases, so it's desirable to also be able to continually fetch and convert its KG.
Observations
Definitions: The survey contains a section in which I've collected definitions of the phrase "knowledge graph" given by the various projects, which mostly boils down to the graph data model they use. There I pointed out that "1) not all projects explicitly describe what type of knowledge graph they are constructing but rather require the reader to deduce it from the context and results, and 2) only a few projects provide references to detailed discussions of the graph data model they have chosen for their purposes, leaving it to the reader to identify and acquire presupposed background knowledge". Of all the discussions, only two provided references to useful literature that defines the term "knowledge graph":
Data sources: The survey contains a section about databases and one about ontologies. Of course there are thousands of biomedical databases and ontologies, so I've not listed them in the survey, but provided references to existing collections. I suppose this may be useful whenever one wants to construct a special-purpose BMKG and needs to systematically find all databases and ontologies that could be relevant to it.
Tools: Another section of the survey lists tools for creating knowledge graphs that were mentioned in the literature I've traversed. BioCypher seems quite useful and was adopted by Rejuve. KGX looks also good together with the Biolink model standard it builds on.
Towards a shared graph data model and file format
From looking at a lot of BMKGs, it appears to me like most or all of them can be mapped onto the property graph model used by many graph databases, which is also explained in the tutorial paper I've linked above. There is also a graph query language standard ISO GQL for it. I suppose both the model and the query language could inform how to represent and query KGs in MeTTa.
In the Python package, I will use an intermediary format, so I can first convert all KGs to it and then have a single converter to MeTTa (or any other format). In the notebooks, I've experimented with a few options for such a format. For multiple reasons I've decided to use a simple CSV file representation and using a JSON string to encode key/value properties. This allows for example to load the data easily into SQL databases and use their built-in JSON functions to query the properties as well. It also enables partial loading relatively easily, e.g. only a subset of nodes and edges, or only triples without properties. I describe it a bit in section 6 of HALD, but it's essentially:
id (str), type (str), properties (dict)
source_id (str), target_id (str), type(str), properties (dict)
Open questions
How should a property graph be represented in MeTTa? a. Are there requirements for identifiers, such as not containing a whitespace or certain other characters? b. Should node and edge types be encoded directly with MeTTa's type system? Or are there other ways that seem appropriate? c. Should nodes and edges be defined together or separately? There seems to be a trade-off between redundancy and clarity. For example, KGs that come as triples usually define the nodes only within the triples and not explicitly on their own since it would be redundant. If there are node and edge properties, however, this quickly becomes messy, so there's often a separation of node and edge definitions. d. How should properties (i.e. list of key/value pairs) be encoded? Should they be unrolled and represented as additional nodes, e.g.
{"PMID": [123123, 412313, 191823]}
as three nodes of typePMID
, with values from the list forming the three node identifiers? Or should they be attached to the entity that carries the property in some way, without introducing new entities? e. Special case: Should relations like "is_subclass_of", e.g. between different disease nodes, be encoded as regular edges as in the original data? Or should it better be expressed by forming a type hierarchy, though its depth might change with additional data?Is the chosen data representation depending on what kind of queries are frequently performed? If so, can there be a "one fits all" format suggestion, or should it rather be tailored to a specific graph data model or perhaps even more specific KG? a. Are there general query patterns that every KG will need to support? b. Are there specific query patterns that only BMKGs will need to support? The papers to the five projects I've mentioned contain various examples of real-world tasks, which might be informative w.r.t. this question, and could be attempted to be reproduced & measured for efficiency. I'm not sure how representative they are though.
(Apologies for the wall of text. I got carried away a bit. I hope it's a useful issue anyways.)