trueagi-io / metta-examples

Discussion of MeTTa programming with examples
MIT License
17 stars 16 forks source link

Biomedical knowledge graphs: Examples, resources, a shared representation, and how to encode it in MeTTa #50

Open robert-haas opened 3 weeks ago

robert-haas commented 3 weeks ago

Biomedical knowledge graphs (BMKGs)

Examples

As part of a DeepFunding round 3 project, I've created a repository about BMKGs that contains 1) a broad survey of what BMKGs are currently available from academic and commercial projects, 2) a narrowed down list of some that I found interesting, and 3) notebooks that inspect five projects closely, which have a relation to pharmacology & drug discovery, e.g. interesting for the task of drug repurposing or side-effect prediction.

The final goal of the project is to bring the KGs of these five projects to Hyperon, not just by a one-time conversion to MeTTa, but with a Python package that can fetch and convert the latest version of each KG whenever desired, and perhaps in multiple formats, maybe even user-customizable. For example, Monarch is a BMKG project that is updated roughly once per month to reflect the latest state of the source databases, so it's desirable to also be able to continually fetch and convert its KG.

Observations

Towards a shared graph data model and file format

From looking at a lot of BMKGs, it appears to me like most or all of them can be mapped onto the property graph model used by many graph databases, which is also explained in the tutorial paper I've linked above. There is also a graph query language standard ISO GQL for it. I suppose both the model and the query language could inform how to represent and query KGs in MeTTa.

In the Python package, I will use an intermediary format, so I can first convert all KGs to it and then have a single converter to MeTTa (or any other format). In the notebooks, I've experimented with a few options for such a format. For multiple reasons I've decided to use a simple CSV file representation and using a JSON string to encode key/value properties. This allows for example to load the data easily into SQL databases and use their built-in JSON functions to query the properties as well. It also enables partial loading relatively easily, e.g. only a subset of nodes and edges, or only triples without properties. I describe it a bit in section 6 of HALD, but it's essentially:

  1. Each node is one line in nodes.csv: id (str), type (str), properties (dict)
  2. Each edge is one line in edges.csv: source_id (str), target_id (str), type(str), properties (dict)

Open questions

  1. How should a property graph be represented in MeTTa? a. Are there requirements for identifiers, such as not containing a whitespace or certain other characters? b. Should node and edge types be encoded directly with MeTTa's type system? Or are there other ways that seem appropriate? c. Should nodes and edges be defined together or separately? There seems to be a trade-off between redundancy and clarity. For example, KGs that come as triples usually define the nodes only within the triples and not explicitly on their own since it would be redundant. If there are node and edge properties, however, this quickly becomes messy, so there's often a separation of node and edge definitions. d. How should properties (i.e. list of key/value pairs) be encoded? Should they be unrolled and represented as additional nodes, e.g. {"PMID": [123123, 412313, 191823]} as three nodes of type PMID, with values from the list forming the three node identifiers? Or should they be attached to the entity that carries the property in some way, without introducing new entities? e. Special case: Should relations like "is_subclass_of", e.g. between different disease nodes, be encoded as regular edges as in the original data? Or should it better be expressed by forming a type hierarchy, though its depth might change with additional data?

  2. Is the chosen data representation depending on what kind of queries are frequently performed? If so, can there be a "one fits all" format suggestion, or should it rather be tailored to a specific graph data model or perhaps even more specific KG? a. Are there general query patterns that every KG will need to support? b. Are there specific query patterns that only BMKGs will need to support? The papers to the five projects I've mentioned contain various examples of real-world tasks, which might be informative w.r.t. this question, and could be attempted to be reproduced & measured for efficiency. I'm not sure how representative they are though.

(Apologies for the wall of text. I got carried away a bit. I hope it's a useful issue anyways.)

robert-haas commented 3 weeks ago

I attach some subgraphs extracted from the five BMKGs that I mentioned as concrete examples: cml.zip

They represent the direct neighborhood of the disease CML, which is present in each BMKG and related to a drug named Imatinib, one of the first targeted cancer therapies.

I've exported the subgraphs in two formats:

  1. The CSV-based format I've described above, where nodes and edges are separated.
  2. GraphML as generated by the Python package igraph. This was the first candidate for an intermediary format that can cover all BMKGs (multi-relational, allowing multi-edges and properties, etc.), but it has several disadvantages over the simple CSV approach.