neuml / txtai

💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows
https://neuml.github.io/txtai
Apache License 2.0
7.39k stars 521 forks source link

Feature request : Advanced Ontology Management #742

Open nicolas-geysse opened 5 days ago

nicolas-geysse commented 5 days ago

Here's a detailed plan to develop advanced ontology management capabilities for TxtAI, leveraging Owlready2 and other relevant libraries:

  1. Enhance Owlready2 integration for more sophisticated ontology manipulation:

a) Extend TxtAI's Graph class to incorporate Owlready2 functionality:

from owlready2 import *
import networkx as nx

class EnhancedOntologyGraph(TxtAIGraph):
    def __init__(self, ontology_iri):
        super().__init__()
        self.onto = get_ontology(ontology_iri).load()
        self.graph = self._build_networkx_graph()

    def _build_networkx_graph(self):
        G = nx.DiGraph()
        for cls in self.onto.classes():
            G.add_node(cls.name, type='class')
            for parent in cls.is_a:
                if isinstance(parent, ThingClass):
                    G.add_edge(parent.name, cls.name)
        return G

    def add_class(self, class_name, parent_classes=None):
        with self.onto:
            new_class = types.new_class(class_name, (self.onto.Thing,))
            if parent_classes:
                for parent in parent_classes:
                    new_class.is_a.append(self.onto[parent])
        self.graph.add_node(class_name, type='class')
        if parent_classes:
            for parent in parent_classes:
                self.graph.add_edge(parent, class_name)

    def add_property(self, prop_name, domain, range):
        with self.onto:
            new_prop = types.new_class(prop_name, (ObjectProperty,))
            new_prop.domain = [self.onto[domain]]
            new_prop.range = [self.onto[range]]
        self.graph.add_edge(domain, range, type='property', name=prop_name)
  1. Implement versioning and change tracking for ontologies:

a) Create a VersionedOntology class that extends EnhancedOntologyGraph:

import datetime
import difflib

class VersionedOntology(EnhancedOntologyGraph):
    def __init__(self, ontology_iri):
        super().__init__(ontology_iri)
        self.version_history = []
        self.current_version = 0

    def save_version(self, comment=""):
        self.current_version += 1
        timestamp = datetime.datetime.now().isoformat()
        serialized_onto = self.onto.serialize(format="ntriples")
        self.version_history.append({
            "version": self.current_version,
            "timestamp": timestamp,
            "comment": comment,
            "data": serialized_onto
        })

    def get_version(self, version_number):
        for version in self.version_history:
            if version["version"] == version_number:
                return version
        return None

    def compare_versions(self, version1, version2):
        v1 = self.get_version(version1)
        v2 = self.get_version(version2)
        if v1 and v2:
            diff = difflib.unified_diff(
                v1["data"].splitlines(),
                v2["data"].splitlines(),
                fromfile=f"v{version1}",
                tofile=f"v{version2}",
                lineterm=""
            )
            return "\n".join(diff)
        return "Versions not found"
  1. Develop tools for ontology alignment and merging:

a) Create an OntologyAligner class:

from rdflib import Graph, URIRef, OWL, RDFS

class OntologyAligner:
    def __init__(self, onto1, onto2):
        self.onto1 = onto1
        self.onto2 = onto2
        self.alignments = []

    def align_classes(self, threshold=0.8):
        for cls1 in self.onto1.classes():
            for cls2 in self.onto2.classes():
                similarity = self._calculate_similarity(cls1, cls2)
                if similarity >= threshold:
                    self.alignments.append((cls1, cls2, similarity))

    def _calculate_similarity(self, cls1, cls2):
        # Implement a similarity measure (e.g., string similarity, structural similarity)
        # This is a placeholder implementation
        return difflib.SequenceMatcher(None, cls1.name, cls2.name).ratio()

    def merge_ontologies(self, output_iri):
        merged_onto = get_ontology(output_iri)
        with merged_onto:
            for cls1, cls2, _ in self.alignments:
                merged_class = types.new_class(cls1.name, (Thing,))
                merged_class.equivalent_to.append(cls2)

            # Copy remaining classes from both ontologies
            for cls in set(self.onto1.classes()) - set(c[0] for c in self.alignments):
                types.new_class(cls.name, (Thing,))
            for cls in set(self.onto2.classes()) - set(c[1] for c in self.alignments):
                types.new_class(cls.name, (Thing,))

        return merged_onto

To use these advanced ontology management tools with TxtAI:

# Create a versioned ontology
vo = VersionedOntology("http://example.org/my_ontology")

# Add classes and properties
vo.add_class("Person")
vo.add_class("Employee", ["Person"])
vo.add_property("works_for", "Employee", "Company")

# Save a version
vo.save_version("Initial version")

# Make changes
vo.add_class("Manager", ["Employee"])
vo.save_version("Added Manager class")

# Compare versions
diff = vo.compare_versions(1, 2)
print(diff)

# Align and merge ontologies
another_onto = get_ontology("http://example.org/another_ontology").load()
aligner = OntologyAligner(vo.onto, another_onto)
aligner.align_classes()
merged_onto = aligner.merge_ontologies("http://example.org/merged_ontology")

# Use the merged ontology in TxtAI
txtai_graph = TxtAIGraph()
txtai_graph.load_from_owlready(merged_onto)

This implementation provides a solid foundation for advanced ontology management within TxtAI, leveraging Owlready2 for ontology manipulation, NetworkX for graph operations, and custom classes for versioning, alignment, and merging. The solution is designed to be simple, well-integrated with TxtAI's ecosystem, and uses open-source libraries.

Citations: [1] https://owlready2.readthedocs.io/en/latest/onto.html [2] https://hal.science/hal-01592746/document [3] https://linuxfr.org/news/owlready-un-module-python-pour-manipuler-les-ontologies-owl [4] https://owlready2.readthedocs.io/_/downloads/en/stable/pdf/ [5] https://stackoverflow.com/questions/74909622/accessing-annotation-of-an-entity-of-ontology-using-owlready [6] https://owlready2.readthedocs.io/en/latest/ [7] https://github.com/pysemtec/semantic-python-overview/blob/main/README.md [8] https://github.com/johmedr/GraphN [9] https://publica-rest.fraunhofer.de/server/api/core/bitstreams/fbf8ccab-86dd-40c3-bb93-4b66b57de57d/content [10] https://owlready2.readthedocs.io/en/latest/reasoning.html [11] https://owlready2.readthedocs.io/en/latest/class.html [12] https://github.com/pwin/owlready2/blob/master/README.rst [13] https://www.researchgate.net/publication/221466162_Tracking_Changes_During_Ontology_Evolution [14] https://enterprise-knowledge.com/top-5-tips-for-managing-and-versioning-an-ontology/ [15] https://link.springer.com/chapter/10.1007/978-3-540-30475-3_19 [16] https://hal.science/hal-04094847/document [17] https://ontology.buffalo.edu/smith/articles/fois2014.pdf [18] https://arxiv.org/abs/1208.1750v1 [19] https://github.com/semanticarts/versioning-ontology [20] https://exmo.inrialpes.fr/cooperation/kweb/SDK-meeting/Presentations/2005-04-SDK%20meeting%20Grenoble%20Versioning.ppt

nicolas-geysse commented 5 days ago

Adding import/export capabilities for various formats, including JSON, would indeed be a valuable addition to the advanced ontology management feature. Let's integrate this functionality into our existing implementation:

from rdflib import Graph, URIRef, Literal, Namespace
from rdflib.namespace import RDF, RDFS, OWL
import json
import owlready2

class AdvancedOntologyManager(EnhancedOntologyGraph):
    def __init__(self, ontology_iri):
        super().__init__(ontology_iri)

    def export_to_format(self, format='turtle', file_path=None):
        """
        Export the ontology to various formats.
        Supported formats: 'turtle', 'xml', 'n3', 'nt', 'json-ld', 'json'
        """
        if format == 'json':
            # Custom JSON export
            json_data = self._ontology_to_json()
            if file_path:
                with open(file_path, 'w') as f:
                    json.dump(json_data, f, indent=2)
            return json.dumps(json_data, indent=2)
        else:
            # Use rdflib's serialization for other formats
            return self.onto.world.as_rdflib_graph().serialize(format=format, destination=file_path)

    def import_from_format(self, file_path, format='turtle'):
        """
        Import ontology from various formats.
        Supported formats: 'turtle', 'xml', 'n3', 'nt', 'json-ld', 'json'
        """
        if format == 'json':
            # Custom JSON import
            with open(file_path, 'r') as f:
                json_data = json.load(f)
            self._json_to_ontology(json_data)
        else:
            # Use rdflib's parser for other formats
            g = Graph()
            g.parse(file_path, format=format)
            self.onto = owlready2.get_ontology("http://temp.org/onto.owl")
            with self.onto:
                for s, p, o in g:
                    if isinstance(s, URIRef):
                        s = owlready2.URIRef(str(s))
                    if isinstance(p, URIRef):
                        p = owlready2.URIRef(str(p))
                    if isinstance(o, URIRef):
                        o = owlready2.URIRef(str(o))
                    elif isinstance(o, Literal):
                        o = owlready2.Literal(str(o))
                    self.onto.world.add((s, p, o))

    def _ontology_to_json(self):
        """Convert ontology to a JSON-serializable dictionary"""
        json_data = {
            "classes": [],
            "properties": [],
            "individuals": []
        }

        for cls in self.onto.classes():
            json_data["classes"].append({
                "name": cls.name,
                "parents": [p.name for p in cls.is_a if isinstance(p, owlready2.ThingClass)]
            })

        for prop in self.onto.properties():
            json_data["properties"].append({
                "name": prop.name,
                "domain": [d.name for d in prop.domain],
                "range": [r.name for r in prop.range]
            })

        for ind in self.onto.individuals():
            json_data["individuals"].append({
                "name": ind.name,
                "type": ind.is_a[0].name if ind.is_a else None
            })

        return json_data

    def _json_to_ontology(self, json_data):
        """Convert JSON data to ontology"""
        with self.onto:
            for cls_data in json_data["classes"]:
                cls = owlready2.types.new_class(cls_data["name"], (owlready2.Thing,))
                for parent_name in cls_data["parents"]:
                    parent = self.onto[parent_name]
                    if parent:
                        cls.is_a.append(parent)

            for prop_data in json_data["properties"]:
                prop = owlready2.types.new_class(prop_data["name"], (owlready2.ObjectProperty,))
                prop.domain = [self.onto[d] for d in prop_data["domain"] if self.onto[d]]
                prop.range = [self.onto[r] for r in prop_data["range"] if self.onto[r]]

            for ind_data in json_data["individuals"]:
                cls = self.onto[ind_data["type"]]
                if cls:
                    cls(ind_data["name"])

# Usage example:
manager = AdvancedOntologyManager("http://example.org/my_ontology")

# Export to various formats
manager.export_to_format(format='turtle', file_path='ontology.ttl')
manager.export_to_format(format='xml', file_path='ontology.owl')
manager.export_to_format(format='json', file_path='ontology.json')

# Import from various formats
manager.import_from_format('ontology.ttl', format='turtle')
manager.import_from_format('ontology.owl', format='xml')
manager.import_from_format('ontology.json', format='json')

This implementation adds the following features:

  1. Export to various formats, including a custom JSON format.
  2. Import from various formats, including the custom JSON format.
  3. Support for common RDF formats (Turtle, RDF/XML, N3, N-Triples, JSON-LD) using rdflib's serialization and parsing capabilities.
  4. A custom JSON format that captures classes, properties, and individuals in a more human-readable structure.

This approach allows for easy integration with different systems and tools that may require specific formats. The custom JSON format provides a simpler representation of the ontology structure, which can be useful for non-RDF-aware systems or for easier manipulation in Python.

By incorporating these import/export capabilities directly into the AdvancedOntologyManager, we maintain a cohesive interface for ontology management while providing flexibility in data exchange formats. This addition enhances the interoperability of the ontology management system with various external tools and workflows.

Citations: [1] https://www.palantir.com/docs/foundry/ontology-manager/export-import/ [2] https://owlready2.readthedocs.io/en/latest/onto.html [3] https://stackoverflow.com/questions/49398085/how-to-save-owl-ontology-in-json-ld-format [4] https://help.poolparty.biz/en/user-guide-for-knowledge-engineers/advanced-features/ontology-management/import,-export---publish-ontologies-or-custom-schemes/import-an-ontology-or-custom-scheme.html [5] https://oboacademy.github.io/obook/explanation/owl-format-variants/ [6] https://stackoverflow.com/questions/74583026/python-rdflib-doesnt-export-context-and-graph-when-serializing-with-json-ld [7] https://github.com/RDFLib/rdflib-jsonld [8] https://rdflib.readthedocs.io/en/stable/intro_to_parsing.html