the-human-colossus-foundation / oca-spec

Overlay Capture Architecture Specification
European Union Public License 1.2
7 stars 7 forks source link

How to model labeled property graphs #41

Open blelump opened 8 months ago

blelump commented 8 months ago

Problem overview

A reference type in OCA reflects a traditional RDBMS where relations among objects are established using unnamed edges. The relation cardinality, that is, how two objects relate to each other, depends on how the objects are built and how the relation is built. So whether this is a has one, has many, or many to many relationship depends on both object structure and relationship structure -- especially in the many to many cases. This is what OCA supports.

When we think of labeled property graphs (LPG), the edges between objects (nodes) can be additionally labeled and equipped with properties. Furthermore, due to the intent of such databases, they put a lot more light on the relationships among objects (nodes).

What's the DSWG angle to model LPGs using OCA? Shall we try to mimic the nodes and edges hierarchy in the graph-as-a-whole, or model them separately and enforce consistency using upper-level layers?

carlyh-micb commented 8 months ago

What would be a real-world example of a LPG dataset to be modelled in OCA? This can help us with discussion.

pknowl commented 8 months ago

The simplest integrated workflow to model a Labeled Property Graph involves two structural pieces and two conceptual pieces to showcase how data can be structured, semantically enriched, and then represented within a graph database environment. Here's how they work together, step by step:

Morphological Semantic Objects: [Note: "Morpho-" relates to form]

  1. OCA Capture Base This JSON snippet represents the foundational data structure for capturing information about, say, scientific samples. It specifies the type of data that can be collected, such as the sample's name, type, and concentration. This is the starting point, where raw data is structured but not yet semantically enriched.

  2. Attribute Framing Overlay The Attribute Framing Overlay (see RFC 0004) provides a mechanism to semantically enrich the data defined in the OCA Capture Base. It maps the attributes (e.g., SampleType, Concentration) to concepts in an ontology. Each attribute is linked to a specific concept (identified by frame_id) using a predicate that specifies the nature of the relationship (skos:exactMatch), indicating that the attribute directly corresponds to the ontology concept. This overlay bridges the gap between the raw data and its semantic context, enhancing interoperability and the potential for more sophisticated data integration and analysis.

Epistemological Semantic Concepts: [Note: "Epistemo-" relates to knowledge]

  1. LPG Ontology The ontology outlines the conceptual framework that the Attribute Framing Overlay refers to. It defines classes such as SampleType and Concentration, providing a structured vocabulary for the data. This ontology serves as the semantic backbone, defining the types of entities and relationships that exist in the domain of interest. By doing so, it enables consistent interpretation of data across different systems and platforms.

  2. LPG Instance (e.g. Cypher Query for Neo4j) Finally, the Cypher query demonstrates how an instance of the semantically enriched data is represented in a Labeled Property Graph (LPG) database, like Neo4j. Here, a Sample node is created with properties (name, type, concentration) directly derived from the OCA Capture Base and implicitly linked to the ontology concepts via the Attribute Framing Overlay. This graph representation allows for exploiting the rich connectivity and relationships between data points for querying and analysis.

How They Interact ...

From Structure to Semantics: Starting with the structured data in the OCA Capture Base, the Attribute Framing Overlay adds a layer of semantic context by linking data attributes to concepts in the ontology. This process transforms the raw data into semantically enriched data.

Ontology as the Semantic Framework: The ontology defines the universe of discourse for the data, establishing a shared vocabulary that ensures consistency and interoperability. It acts as the reference model for the semantic enrichment process.

Graph Representation for Analysis and Integration: The enriched data is then instantiated in an LPG database, where the semantic links (established by the ontology and overlay) inform the graph structure. This enables leveraging graph-based analysis and integration techniques, benefiting from the semantic depth provided by the ontology.

Together, these components form a comprehensive approach to data management that spans initial data capture, semantic enrichment, and final data representation in a graph database.

Out of scope for this thread, the final step will be to pull this semantic enrichment into the training datasets underpinning LLMs and LAMs. The final frontier!