neuml / txtai

💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows
https://neuml.github.io/txtai
Apache License 2.0
8.8k stars 580 forks source link

Feature Request: Graph RAG - Add extra attributes #684

Open andrelima666 opened 6 months ago

andrelima666 commented 6 months ago

Hello,

As I was trying to work this into a Graph LLM Rag. I was thinking on doing some queries based on data type (example: node is a 'person', 'skill', ...). The idea was to have a person A, identified by node A1. Person B, identified by Node B1. Skills like soccer or swimming (S1 and S2). So, 4 nodes at the graph at this point.

Idea is if person A1 and B2 share the skill, to return those vertex, with the skill connecting them. But, when I try to run the next query, it is returning an empty list: MATCH P=(N)-[*1..2]->(D) WHERE N.type == 'person' RETURN P

Wasn't this suppose to bring at least the person and the skills from it? I'm assuming the issue is because graph has no 'type' attribute in it.

Main point/question: Is it possible to add additional attributes? What I tried until now it's only setting attributes as [id, text, topic, topicrank]. How can a new attribute, like 'type' be added and persisted?

davidmezzetti commented 6 months ago

Thank you for this write up.

I debated over whether to duplicate the attribute data between databases and graphs. With 7.0, I decided to only sync text data with the graph component.

But I agree that it would be nice to have access to all attributes to handle scenarios such as what is mentioned above. With that, I'll modify the logic to sync attributes.

I'll think about if attribute syncing should be always on, on but able to be disabled or off and able to be enabled.

nicolas-geysse commented 3 months ago

Hello ! After some research this approach might be interesting ?

This is a detailed sequential proposal to integrate typing and advanced querying capabilities into TxtAI, leveraging NetworkX and gradually adding RDF/SPARQL features:

Title: Progressive Extension of Graph Capabilities in TxtAI with Typing and RDF/SPARQL Support

Phase 1: Adding Basic Typing

1. Extend the TxtAI Graph class:

from txtai.graph import Graph as TxtAIGraph
import networkx as nx

class TypedGraph(TxtAIGraph):
    def add_typed_node(self, node, node_type, **attr):
        self.graph.add_node(node, type=node_type, **attr)

    def add_typed_edge(self, u, v, edge_type, **attr):
        self.graph.add_edge(u, v, type=edge_type, **attr)

    def get_nodes_by_type(self, node_type):
        return [node for node, data in self.graph.nodes(data=True) if data.get('type') == node_type]

    def get_edges_by_type(self, edge_type):
        return [(u, v) for u, v, data in self.graph.edges(data=True) if data.get('type') == edge_type]

Usage example:

G = TypedGraph()
G.add_typed_node(1, "person", name="Alice")
G.add_typed_node(2, "person", name="Bob")
G.add_typed_node(3, "skill", name="Python")
G.add_typed_edge(1, 3, "has_skill")
G.add_typed_edge(2, 3, "has_skill")

people = G.get_nodes_by_type("person")
skills = G.get_nodes_by_type("skill")

Phase 2: Integrating RDFLib-NetworkX

2. Add RDFLib-NetworkX as a dependency and extend the TypedGraph class:

from rdflib_networkx import network_to_rdflib, rdflib_to_network

class RDFTypedGraph(TypedGraph):
    def to_rdf(self):
        return network_to_rdflib(self.graph)

    def from_rdf(self, rdf_graph):
        self.graph = rdflib_to_network(rdf_graph)

    def load_rdf(self, file_path, format='turtle'):
        import rdflib
        g = rdflib.Graph()
        g.parse(file_path, format=format)
        self.from_rdf(g)

Usage example:

G = RDFTypedGraph()
G.load_rdf("data.ttl")
rdf_graph = G.to_rdf()
rdf_graph.serialize(destination='output.ttl', format='turtle')

Phase 3: Adding Basic SPARQL Support

3. Implement a simple SPARQL execution method:

from rdflib.plugins.sparql import prepareQuery

class SPARQLGraph(RDFTypedGraph):
    def execute_sparql(self, query_string):
        rdf_graph = self.to_rdf()
        query = prepareQuery(query_string)
        results = rdf_graph.query(query)
        return list(results)

Usage example:

G = SPARQLGraph()
G.load_rdf("data.ttl")

results = G.execute_sparql("""
    SELECT ?person ?skill
    WHERE {
        ?person a :Person ;
                :hasSkill ?skill .
    }
""")

for row in results:
    print(f"Person: {row.person}, Skill: {row.skill}")

Phase 4: Integration with Existing TxtAI Features

4. Ensure compatibility with TxtAI's semantic search methods:

from txtai.embeddings import Embeddings

class EnhancedGraph(SPARQLGraph):
    def __init__(self):
        super().__init__()
        self.embeddings = Embeddings()

    def semantic_subgraph(self, query, limit=5):
        similar = self.embeddings.search(query, limit)
        subgraph = self.graph.subgraph([node for node, _ in similar])
        return EnhancedGraph().from_networkx(subgraph)

    def sparql_with_embedding(self, query, sparql_template):
        similar = self.embeddings.search(query, 1)[0][0]
        sparql_query = sparql_template.format(entity=similar)
        return self.execute_sparql(sparql_query)

Final usage example:

G = EnhancedGraph()
G.load_rdf("knowledge_base.ttl")

# Semantic search + subgraph
subgraph = G.semantic_subgraph("machine learning")

# SPARQL query with embedding
results = G.sparql_with_embedding("AI techniques", """
    SELECT ?related_concept
    WHERE {{
        <{entity}> :relatedTo ?related_concept .
    }}
""")

for row in results:
    print(f"Related concept: {row.related_concept}")

This progressive approach allows for the addition of typing, RDF support, and SPARQL querying while maintaining compatibility with TxtAI's existing NetworkX-based infrastructure. It provides a smooth transition to more advanced knowledge graph capabilities while preserving the integration with TxtAI's semantic search features.

Practical Usage Example:

Once this class is implemented, here is how it could be used to solve the initial problem:

# Creating the graph
G = TypedGraph()

# Adding people
G.add_typed_node('A1', 'person', name='Person A')
G.add_typed_node('B1', 'person', name='Person B')

# Adding skills
G.add_typed_node('S1', 'skill', name='Soccer')
G.add_typed_node('S2', 'skill', name='Swimming')

# Adding relationships
G.add_typed_edge('A1', 'S1', 'has_skill')
G.add_typed_edge('A1', 'S2', 'has_skill')
G.add_typed_edge('B1', 'S1', 'has_skill')

# Searching for people
persons = G.get_nodes_by_type('person')
print("People:", persons)

# Searching for shared skills
skills = G.get_nodes_by_type('skill')
for skill in skills:
    persons_with_skill = [n for n in G.graph.neighbors(skill) if G.graph.nodes[n]['type'] == 'person']
    if len(persons_with_skill) > 1:
        print(f"Skill {G.graph.nodes[skill]['name']} shared by: {persons_with_skill}")

This approach allows you to:

It solves the initial problem by allowing the addition of custom attributes (such as type) to nodes and edges, and using them for advanced queries.

nicolas-geysse commented 3 months ago

Owlready2 is missing from the provided implementation. To make the example more complete and leverage the capabilities of Owlready2, we can modify and extend the implementation. Here's a step-by-step example incorporating Owlready2: (It's Sonnet 3.5 speaking, but directed by me ;) )

from txtai.graph import Graph as TxtAIGraph
import networkx as nx
from owlready2 import *
from rdflib_networkx import network_to_rdflib, rdflib_to_network
from rdflib.plugins.sparql import prepareQuery
from txtai.embeddings import Embeddings

class EnhancedGraph(TxtAIGraph):
    def __init__(self):
        super().__init__()
        self.onto = get_ontology("http://test.org/onto.owl")
        self.onto.metadata.declare()
        self.embeddings = Embeddings()

    def add_typed_node(self, node, node_type, **attr):
        with self.onto:
            if not self.onto[node_type]:
                types.new_class(node_type, (Thing,))
            new_individual = self.onto[node_type](node)
            for key, value in attr.items():
                setattr(new_individual, key, value)
        self.graph.add_node(node, type=node_type, **attr)

    def add_typed_edge(self, u, v, edge_type, **attr):
        with self.onto:
            if not self.onto[edge_type]:
                types.new_class(edge_type, (ObjectProperty,))
            self.onto[edge_type](self.onto[u], self.onto[v])
        self.graph.add_edge(u, v, type=edge_type, **attr)

    def get_nodes_by_type(self, node_type):
        return list(self.onto[node_type].instances())

    def get_edges_by_type(self, edge_type):
        return [(u, v) for u, v, data in self.graph.edges(data=True) if data.get('type') == edge_type]

    def to_rdf(self):
        return network_to_rdflib(self.graph)

    def from_rdf(self, rdf_graph):
        self.graph = rdflib_to_network(rdf_graph)

    def load_rdf(self, file_path, format='turtle'):
        self.onto = get_ontology(file_path).load()
        for cls in self.onto.classes():
            for instance in cls.instances():
                self.add_typed_node(instance.name, cls.name)
        for prop in self.onto.object_properties():
            for subject, object in prop.get_relations():
                self.add_typed_edge(subject.name, object.name, prop.name)

    def execute_sparql(self, query_string):
        rdf_graph = self.to_rdf()
        query = prepareQuery(query_string)
        results = rdf_graph.query(query)
        return list(results)

    def semantic_subgraph(self, query, limit=5):
        similar = self.embeddings.search(query, limit)
        subgraph = self.graph.subgraph([node for node, _ in similar])
        return EnhancedGraph().from_networkx(subgraph)

    def sparql_with_embedding(self, query, sparql_template):
        similar = self.embeddings.search(query, 1)[0][0]
        sparql_query = sparql_template.format(entity=similar)
        return self.execute_sparql(sparql_query)

    def reason(self):
        with self.onto:
            sync_reasoner()

# Usage example
G = EnhancedGraph()

# Adding people
G.add_typed_node('A1', 'Person', name='Person A')
G.add_typed_node('B1', 'Person', name='Person B')

# Adding skills
G.add_typed_node('S1', 'Skill', name='Soccer')
G.add_typed_node('S2', 'Skill', name='Swimming')

# Adding relationships
G.add_typed_edge('A1', 'S1', 'hasSkill')
G.add_typed_edge('A1', 'S2', 'hasSkill')
G.add_typed_edge('B1', 'S1', 'hasSkill')

# Searching for people
persons = G.get_nodes_by_type('Person')
print("People:", [p.name for p in persons])

# Searching for shared skills
skills = G.get_nodes_by_type('Skill')
for skill in skills:
    persons_with_skill = [n.name for n in skill.hasSkill.inverse()]
    if len(persons_with_skill) > 1:
        print(f"Skill {skill.name} shared by: {persons_with_skill}")

# Using SPARQL
results = G.execute_sparql("""
    SELECT ?person ?skill
    WHERE {
        ?person a :Person ;
                :hasSkill ?skill .
    }
""")

for row in results:
    print(f"Person: {row.person}, Skill: {row.skill}")

# Using semantic search
subgraph = G.semantic_subgraph("sports")

# Using reasoning
G.reason()

This implementation incorporates Owlready2, allowing for more advanced ontology manipulation and reasoning. It combines the strengths of NetworkX, RDFLib, and Owlready2, providing a powerful toolkit for working with typed graphs, RDF data, and ontologies within the TxtAI framework.

Citations: [1] https://owlready2.readthedocs.io/en/latest/ [2] https://github.com/pysemtec/semantic-python-overview/blob/main/README.md [3] https://github.com/johmedr/GraphN [4] https://archive.org/details/github.com-pysemtec-semantic-python-overview_-_2022-02-03_23-49-50 [5] https://publica-rest.fraunhofer.de/server/api/core/bitstreams/fbf8ccab-86dd-40c3-bb93-4b66b57de57d/content

nicolas-geysse commented 3 months ago

Another approach: Here's a detailed development of the point "Optimization of existing semantic graphs" with a focus on attribute synchronization and basic JSON import/export:

  1. Attribute Synchronization Implementation:

• Use NetworkX's built-in attribute handling:

import networkx as nx

def sync_attributes(G, attributes):
    nx.set_node_attributes(G, attributes)

# Usage
G = nx.Graph()
attributes = {1: {'type': 'person'}, 2: {'type': 'skill'}}
sync_attributes(G, attributes)

• Extend TxtAI's Graph class:

from txtai.graph import Graph

class EnhancedGraph(Graph):
    def sync_attributes(self, attributes):
        nx.set_node_attributes(self.graph, attributes)

    def get_node_attributes(self, attribute):
        return nx.get_node_attributes(self.graph, attribute)
  1. Basic JSON Import/Export:

• Utilize NetworkX's JSON functionality:

import json
import networkx as nx

def export_to_json(G, filename):
    data = nx.node_link_data(G)
    with open(filename, 'w') as f:
        json.dump(data, f)

def import_from_json(filename):
    with open(filename, 'r') as f:
        data = json.load(f)
    return nx.node_link_graph(data)

# Extend TxtAI's Graph class
class EnhancedGraph(Graph):
    def to_json(self, filename):
        export_to_json(self.graph, filename)

    @classmethod
    def from_json(cls, filename):
        G = import_from_json(filename)
        enhanced_graph = cls()
        enhanced_graph.graph = G
        return enhanced_graph
  1. Integration with TxtAI:

• Modify TxtAI's Graph class to include these new methods:

from txtai.graph import Graph

class EnhancedGraph(Graph):
    def sync_attributes(self, attributes):
        nx.set_node_attributes(self.graph, attributes)

    def get_node_attributes(self, attribute):
        return nx.get_node_attributes(self.graph, attribute)

    def to_json(self, filename):
        data = nx.node_link_data(self.graph)
        with open(filename, 'w') as f:
            json.dump(data, f)

    @classmethod
    def from_json(cls, filename):
        with open(filename, 'r') as f:
            data = json.load(f)
        G = nx.node_link_graph(data)
        enhanced_graph = cls()
        enhanced_graph.graph = G
        return enhanced_graph

Benefits:

  1. Resolves the initial problem of missing attributes by providing methods to synchronize and retrieve node attributes.
  2. Improves interoperability with other systems through JSON import/export functionality.
  3. Maintains simplicity and integration with TxtAI's ecosystem by extending the existing Graph class.
  4. Utilizes NetworkX's built-in functions for efficiency and compatibility.

This implementation provides a straightforward way to handle attribute synchronization and basic JSON import/export within the TxtAI framework, addressing the initial attribute problem while enhancing interoperability with external systems.

Citations: [1] https://networkx.org/documentation/stable/reference/generated/networkx.classes.function.set_node_attributes.html [2] https://www.eng-tips.com/viewthread.cfm?qid=401197 [3] https://networkx.org/documentation/stable/release/release_3.0.html [4] https://stackoverflow.com/questions/42224819/synchronize-attribute-changes-in-python [5] https://community.sw.siemens.com/s/question/0D5KZ000007b4Bk0AI/how-to-synchronize-nx-callout-attribute-with-find-no-in-structure-manager [6] https://stackoverflow.com/questions/12309269/how-do-i-write-json-data-to-a-file [7] https://docs.python.org/fr/3/library/json.html [8] https://www.geeksforgeeks.org/reading-and-writing-json-to-a-file-in-python/ [9] https://stackabuse.com/reading-and-writing-json-to-a-file-in-python/ [10] https://docs.python.org/3/library/json.html [11] https://hackr.io/blog/best-python-libraries [12] https://www.reddit.com/r/learnpython/comments/wqq4tq/what_are_the_best_libraries_for_parsing_json/ [13] https://www.xlwings.org/blog/reporting-with-python [14] https://learnpython.com/blog/python-libraries-for-data-science/ [15] https://dhananjaykulkarni.hashnode.dev/python-libraries-for-devops

Integration of Cypher for Graph Queries • Addition: Implementation of queries based on types and attributes • Libraries: grand-cypher • Benefits: Ability to perform complex queries on graphs

Implementation:

  1. Install the required library:

    pip install grand-cypher
  2. Extend the TxtAI Graph class to include Cypher querying capabilities:

from txtai.graph import Graph as TxtAIGraph
from grandcypher import GrandCypher
import networkx as nx

class CypherEnabledGraph(TxtAIGraph):
    def __init__(self):
        super().__init__()
        self.cypher = GrandCypher(self.graph)

    def cypher_query(self, query):
        return self.cypher.run(query)

    def add_typed_node(self, node, node_type, **attr):
        attr['type'] = node_type
        self.graph.add_node(node, **attr)

    def add_typed_edge(self, u, v, edge_type, **attr):
        attr['type'] = edge_type
        self.graph.add_edge(u, v, **attr)
  1. Implement methods to perform type-based queries:
    def get_nodes_by_type(self, node_type):
        query = f"""
        MATCH (n)
        WHERE n.type = '{node_type}'
        RETURN n
        """
        return self.cypher_query(query)

    def get_edges_by_type(self, edge_type):
        query = f"""
        MATCH ()-[r]->()
        WHERE r.type = '{edge_type}'
        RETURN r
        """
        return self.cypher_query(query)

    def find_connected_nodes(self, start_node, relationship_type, end_node_type):
        query = f"""
        MATCH (start {{id: '{start_node}'}})-[r:{relationship_type}]->(end {{type: '{end_node_type}'}})
        RETURN end
        """
        return self.cypher_query(query)

Usage example:

graph = CypherEnabledGraph()

# Adding nodes and edges with types
graph.add_typed_node('A1', 'person', name='Person A')
graph.add_typed_node('B1', 'person', name='Person B')
graph.add_typed_node('S1', 'skill', name='Python')
graph.add_typed_edge('A1', 'S1', 'has_skill')
graph.add_typed_edge('B1', 'S1', 'has_skill')

# Performing type-based queries
persons = graph.get_nodes_by_type('person')
skills = graph.get_nodes_by_type('skill')
has_skill_edges = graph.get_edges_by_type('has_skill')

# Finding connected nodes
python_skilled_persons = graph.find_connected_nodes('S1', 'has_skill', 'person')

print("Persons:", persons)
print("Skills:", skills)
print("Has Skill Edges:", has_skill_edges)
print("Persons with Python skill:", python_skilled_persons)

This implementation solves the initial type problem by:

  1. Allowing the addition of typed nodes and edges
  2. Providing methods to query the graph based on types
  3. Enabling complex Cypher queries that can leverage both the graph structure and node/edge attributes

The integration with grand-cypher allows for more expressive and powerful queries while maintaining compatibility with TxtAI's existing graph structure. This approach provides a good balance between simplicity, integration with TxtAI's ecosystem, and the ability to perform complex graph queries.

Citations: [1] https://github.com/aplbrain/grand-cypher [2] https://memgraph.com/blog/how-to-write-custom-cypher-procedures-with-networkx-and-memgraph [3] https://groups.google.com/g/networkx-discuss/c/izY-vou8uCU [4] https://community.neo4j.com/t/passing-neo4j-subgraph-to-python/3209 [5] https://memgraph.com/docs/advanced-algorithms/utilize-networkx [6] https://www.reddit.com/r/coolgithubprojects/comments/1ao534z/github_aplbraingrandcypher_implementation_of_the/ [7] https://pypi.org/project/grand-cypher/ [8] https://github.com/aplbrain/grand-cypher-io [9] https://github.com/szarnyasg/awesome-cypher [10] https://neo4j.com/docs/graphql/current/ [11] https://stackoverflow.com/questions/59289134/constructing-networkx-graph-from-neo4j-query-result [12] https://gist.github.com/aanastasiou/6099561?permalink_comment_id=876244

Enhancing RAG Capabilities • Addition: Utilization of attributes and types in the RAG process • Libraries: transformers, networkx • Category: LLM Integration for Knowledge Graph Enhancement • Benefits: More precise text generation based on node attributes

Implementation:

  1. Extend TxtAI's Graph class to include attribute-aware embeddings:
from txtai.graph import Graph
from transformers import AutoTokenizer, AutoModel
import torch
import networkx as nx

class AttributeAwareGraph(Graph):
    def __init__(self):
        super().__init__()
        self.tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
        self.model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

    def get_node_embedding(self, node):
        attributes = self.graph.nodes[node]
        text = f"Node: {node}, Type: {attributes.get('type', 'Unknown')}, " + \
               ", ".join([f"{k}: {v}" for k, v in attributes.items() if k != 'type'])

        inputs = self.tokenizer(text, return_tensors="pt", truncation=True, padding=True)
        with torch.no_grad():
            outputs = self.model(**inputs)
        return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()

    def get_subgraph_embedding(self, nodes):
        embeddings = [self.get_node_embedding(node) for node in nodes]
        return np.mean(embeddings, axis=0)
  1. Integrate attribute-aware embeddings into the RAG process:
from txtai.pipeline import LLM

class EnhancedRAG:
    def __init__(self, graph, llm_model="gpt-3.5-turbo"):
        self.graph = graph
        self.llm = LLM(model=llm_model)

    def generate_response(self, query, k=3):
        # Get query embedding
        query_embedding = self.graph.get_node_embedding({"type": "query", "text": query})

        # Find most similar nodes
        similarities = []
        for node in self.graph.graph.nodes():
            node_embedding = self.graph.get_node_embedding(node)
            similarity = cosine_similarity([query_embedding], [node_embedding])[0][0]
            similarities.append((node, similarity))

        top_nodes = sorted(similarities, key=lambda x: x[1], reverse=True)[:k]

        # Generate context from top nodes
        context = "\n".join([f"Node {node}: {dict(self.graph.graph.nodes[node])}" for node, _ in top_nodes])

        # Generate response using LLM
        prompt = f"Query: {query}\nContext:\n{context}\nResponse:"
        response = self.llm(prompt)

        return response
  1. Usage example:
graph = AttributeAwareGraph()
graph.add_node("A1", type="person", name="Alice", age=30)
graph.add_node("B1", type="person", name="Bob", age=35)
graph.add_node("S1", type="skill", name="Python")
graph.add_edge("A1", "S1", type="has_skill")

rag = EnhancedRAG(graph)
response = rag.generate_response("Who knows Python?")
print(response)

This implementation enhances the RAG capabilities by:

  1. Incorporating node attributes and types into the embedding process.
  2. Using these attribute-aware embeddings to find relevant nodes for the RAG context.
  3. Providing a richer context to the LLM, including node attributes and types.

This approach indirectly addresses the initial type problem by making the RAG process more aware of node types and attributes. It allows for more precise text generation based on the structured information in the graph, including types and other attributes.

The implementation is well-integrated with TxtAI's ecosystem, extending its Graph class and using its LLM pipeline. It also leverages NetworkX for graph operations and the transformers library for generating embeddings, both of which are commonly used in the TxtAI ecosystem.

Citations: [1] https://www.partitech.com/fr/blog-technique/system-rag-txtai-partie-1 [2] https://neuml.hashnode.dev/build-rag-pipelines-with-txtai [3] https://www.partitech.com/fr/blog-technique/creation-systeme-rag-txtai-partie-3 [4] https://dev.to/neuml/how-rag-with-txtai-works-4lkh [5] https://neuml.github.io/txtai/usecases/ [6] https://github.com/neuml/txtai [7] https://github.com/neuml/txtai/blob/master/examples/53_Integrate_LLM_Frameworks.ipynb [8] https://neuml.github.io/txtai/examples/ [9] https://www.linkedin.com/pulse/exploring-d-rag-simplified-approach-generation-efficient-kurt-heiz-sia8c [10] https://docs.databricks.com/en/ai-cookbook/quality-rag-chain.html [11] https://docs.langchain4j.dev/tutorials/rag/ [12] https://www.pinecone.io/learn/advanced-rag-techniques/ [13] https://learn.microsoft.com/en-us/azure/databricks/ai-cookbook/quality-rag-chain [14] https://www.researchgate.net/figure/llustration-of-the-RAG-attributes_fig2_257334089 [15] https://www.pinecone.io/learn/rag-access-control/ [16] https://blog.kuzudb.com/post/llms-graphs-part-1/ [17] https://www.datacamp.com/tutorial/knowledge-graph-rag [18] https://neo4j.com/developer-blog/knowledge-graph-rag-application/ [19] https://ragaboutit.com/knowledge-graphs-for-retrieval-augmented-generation-rag/ [20] https://neurons-lab.com/article/empowering-rag-using-knowledge-graphs/