neuml / txtai

💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows
https://neuml.github.io/txtai
Apache License 2.0
9.51k stars 611 forks source link

Feature request : LLM Integration for Knowledge Graph Enhancement #741

Open nicolas-geysse opened 5 months ago

nicolas-geysse commented 5 months ago

Based on the requirements and the existing TxtAI ecosystem, here's a proposed approach to develop LLM Integration for Knowledge Graph Enhancement:

  1. Automatic Knowledge Graph Generation and Enrichment:
from txtai.pipeline import TextToGraph
from txtai.graph import Graph
import networkx as nx

class LLMEnhancedGraph(Graph):
    def __init__(self):
        super().__init__()
        self.text_to_graph = TextToGraph()

    def generate_from_llm(self, llm_output):
        # Convert LLM output to graph structure
        graph_data = self.text_to_graph(llm_output)

        # Add new nodes and edges to existing graph
        for node, data in graph_data.nodes(data=True):
            self.graph.add_node(node, **data)
        for u, v, data in graph_data.edges(data=True):
            self.graph.add_edge(u, v, **data)

    def enrich_existing_graph(self, llm_output):
        new_graph = self.text_to_graph(llm_output)
        self.graph = nx.compose(self.graph, new_graph)
  1. Validation and Integration Pipeline:
from txtai.embeddings import Embeddings

class ValidationPipeline:
    def __init__(self, graph, embeddings):
        self.graph = graph
        self.embeddings = embeddings

    def validate_and_integrate(self, new_nodes, threshold=0.8):
        for node, data in new_nodes:
            # Check for similar existing nodes
            similar = self.embeddings.search(node, 1)
            if similar and similar[0][1] > threshold:
                # Merge with existing node
                existing_node = similar[0][0]
                self.graph.graph.nodes[existing_node].update(data)
            else:
                # Add as new node
                self.graph.graph.add_node(node, **data)
  1. Feedback Mechanism:
class FeedbackMechanism:
    def __init__(self, graph, embeddings):
        self.graph = graph
        self.embeddings = embeddings
        self.feedback_log = []

    def log_feedback(self, node, feedback):
        self.feedback_log.append((node, feedback))

    def apply_feedback(self):
        for node, feedback in self.feedback_log:
            if feedback == 'positive':
                # Increase confidence or weight of the node
                self.graph.graph.nodes[node]['confidence'] = self.graph.graph.nodes[node].get('confidence', 1) * 1.1
            elif feedback == 'negative':
                # Decrease confidence or weight of the node
                self.graph.graph.nodes[node]['confidence'] = self.graph.graph.nodes[node].get('confidence', 1) * 0.9

    def retrain_embeddings(self):
        # Extract text from graph nodes
        texts = [data.get('text', '') for _, data in self.graph.graph.nodes(data=True)]
        # Retrain embeddings with updated graph data
        self.embeddings.index(texts)
  1. Integration with TxtAI:
from txtai.pipeline import LLM

class LLMGraphEnhancer:
    def __init__(self, graph, embeddings, llm_model="gpt-3.5-turbo"):
        self.graph = LLMEnhancedGraph()
        self.validation = ValidationPipeline(self.graph, embeddings)
        self.feedback = FeedbackMechanism(self.graph, embeddings)
        self.llm = LLM(model=llm_model)

    def enhance_graph(self, query):
        # Generate new knowledge using LLM
        llm_output = self.llm(f"Generate knowledge graph for: {query}")

        # Generate and enrich graph
        self.graph.generate_from_llm(llm_output)

        # Validate and integrate new nodes
        new_nodes = self.graph.graph.nodes(data=True)
        self.validation.validate_and_integrate(new_nodes)

        # Apply feedback and retrain embeddings
        self.feedback.apply_feedback()
        self.feedback.retrain_embeddings()

    def get_enhanced_graph(self):
        return self.graph.graph

This implementation:

  1. Uses TxtAI's existing TextToGraph pipeline for converting LLM outputs to graph structures.
  2. Leverages NetworkX for graph operations, which is already used by TxtAI.
  3. Utilizes TxtAI's Embeddings for similarity checks in the validation process.
  4. Implements a feedback mechanism that adjusts node confidence and retrains embeddings.
  5. Integrates with TxtAI's LLM pipeline for generating new knowledge.

To use this enhanced graph system:

from txtai.embeddings import Embeddings

embeddings = Embeddings()
enhancer = LLMGraphEnhancer(Graph(), embeddings)

enhancer.enhance_graph("Artificial Intelligence")
enhanced_graph = enhancer.get_enhanced_graph()

This approach provides a simple, integrated solution for enhancing knowledge graphs with LLM outputs within the TxtAI ecosystem, while also incorporating feedback mechanisms for continuous improvement.

Citations: [1] https://github.com/dylanhogg/llmgraph [2] https://neo4j.com/developer-blog/construct-knowledge-graphs-unstructured-text/ [3] https://www.visual-design.net/post/llm-prompt-engineering-techniques-for-knowledge-graph [4] https://datavid.com/blog/merging-large-language-models-and-knowledge-graphs-integration [5] https://arxiv.org/pdf/2405.15436.pdf [6] https://medium.com/neo4j/a-tale-of-llms-and-graphs-the-inaugural-genai-graph-gathering-c880119e43fe [7] https://www.linkedin.com/pulse/transforming-llm-reliability-graphster-20-wisecubes-hallucination-j8adf [8] https://ragaboutit.com/building-a-graph-rag-system-enhancing-llms-with-knowledge-graphs/ [9] https://arxiv.org/html/2312.11282v2 [10] https://blog.langchain.dev/enhancing-rag-based-applications-accuracy-by-constructing-and-leveraging-knowledge-graphs/ [11] https://github.com/XiaoxinHe/Awesome-Graph-LLM [12] https://www.linkedin.com/pulse/optimizing-llm-precision-knowledge-graph-based-natural-language-lyere

nicolas-geysse commented 5 months ago

Implementing Direct Embedding Association in TxtAI:

Feature: Direct Embedding Association (Relates to: LLM Integration for Knowledge Graph Enhancement) (New Feature Tag)

  1. Implement a system to store embedding vectors directly with graph nodes:
import networkx as nx
from txtai.embeddings import Embeddings

class EnhancedGraph(nx.Graph):
    def __init__(self):
        super().__init__()
        self.embeddings = Embeddings()

    def add_node(self, node_for_adding, **attr):
        super().add_node(node_for_adding, **attr)
        if 'text' in attr:
            embedding = self.embeddings.transform(attr['text'])
            self.nodes[node_for_adding]['embedding'] = embedding

    def get_node_embedding(self, node):
        return self.nodes[node].get('embedding', None)
  1. Develop a mechanism to update embeddings efficiently when node content changes:
    def update_node_content(self, node, new_text):
        self.nodes[node]['text'] = new_text
        new_embedding = self.embeddings.transform(new_text)
        self.nodes[node]['embedding'] = new_embedding

    def update_affected_nodes(self, changed_node):
        for neighbor in self.neighbors(changed_node):
            neighbor_text = self.nodes[neighbor]['text']
            context = f"{self.nodes[changed_node]['text']} {neighbor_text}"
            new_embedding = self.embeddings.transform(context)
            self.nodes[neighbor]['embedding'] = new_embedding

Integration with TxtAI ecosystem: This implementation leverages TxtAI's Embeddings class for generating and transforming embeddings. It extends NetworkX's Graph class, which is already used in TxtAI, ensuring compatibility with existing graph operations.

Usage example:

graph = EnhancedGraph()
graph.add_node(1, text="Example node content")
embedding = graph.get_node_embedding(1)

graph.update_node_content(1, "Updated node content")
graph.update_affected_nodes(1)

This feature enhances the "LLM Integration for Knowledge Graph Enhancement" part of the roadmap by providing a direct and efficient way to associate embeddings with graph nodes. It allows for quick retrieval and update of embeddings, which is crucial for real-time graph updates and queries.

The implementation is simple, well-integrated with TxtAI's existing components, and uses NetworkX as the underlying graph library. This approach ensures that the new feature fits seamlessly into the TxtAI ecosystem while providing the necessary functionality for direct embedding association and efficient updates.

Citations: [1] https://stackoverflow.com/questions/78173243/vector-store-created-using-existing-graph-for-multiple-nodes-labels [2] https://www.kaggle.com/code/shakshisharma/graph-embeddings-deepwalk-and-node2vec [3] https://towardsdatascience.com/graph-embeddings-how-nodes-get-mapped-to-vectors-2e12549457ed?gi=78f28874cc8e [4] https://community.neo4j.com/t/setting-vector-embedding-to-the-node-using-the-python-sdk/66043 [5] https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.planarity.PlanarEmbedding.html [6] https://ieeexplore.ieee.org/document/9925994 [7] https://github.com/VHRanger/nodevectors [8] http://www.shuwu.name/sw/DyGCN.pdf [9] https://appliednetsci.springeropen.com/articles/10.1007/s41109-019-0169-5 [10] https://www.cs.emory.edu/~jyang71/files/dyhine.pdf [11] https://stackoverflow.com/questions/55460965/creating-embeddings-using-node2vec [12] https://networkx.org/documentation/stable/auto_examples/drawing/plot_spectral_grid.html [13] https://maelfabien.github.io/machinelearning/graph_5/

nicolas-geysse commented 5 months ago

Proposal for implementing Indexing Optimization with HNSW and hybrid indexing:

Feature: Advanced Indexing Optimization (Relates to: LLM Integration for Knowledge Graph Enhancement) (New Feature Tag)

  1. Implement HNSW for faster nearest neighbor search:
import hnswlib
from txtai.graph import Graph

class HNSWGraph(Graph):
    def __init__(self, dim, max_elements, ef_construction=200, M=16):
        super().__init__()
        self.index = hnswlib.Index(space='cosine', dim=dim)
        self.index.init_index(max_elements=max_elements, ef_construction=ef_construction, M=M)
        self.node_map = {}

    def add_node(self, node_id, embedding, **attr):
        super().add_node(node_id, **attr)
        index = len(self.node_map)
        self.node_map[node_id] = index
        self.index.add_items(embedding.reshape(1, -1), [index])

    def nearest_neighbors(self, query_embedding, k=10):
        labels, distances = self.index.knn_query(query_embedding.reshape(1, -1), k=k)
        return [list(self.node_map.keys())[label] for label in labels[0]]
  1. Create a hybrid index combining graph structure and semantic embeddings:
import networkx as nx
from txtai.embeddings import Embeddings

class HybridGraph(HNSWGraph):
    def __init__(self, dim, max_elements, ef_construction=200, M=16):
        super().__init__(dim, max_elements, ef_construction, M)
        self.graph = nx.Graph()
        self.embeddings = Embeddings()

    def add_node(self, node_id, text, **attr):
        embedding = self.embeddings.transform(text)
        super().add_node(node_id, embedding, **attr)
        self.graph.add_node(node_id, text=text, **attr)

    def add_edge(self, u, v, **attr):
        self.graph.add_edge(u, v, **attr)

    def search(self, query, k=10):
        query_embedding = self.embeddings.transform(query)
        nn_nodes = self.nearest_neighbors(query_embedding, k)

        subgraph = self.graph.subgraph(nn_nodes)
        pagerank = nx.pagerank(subgraph)

        return sorted(pagerank.items(), key=lambda x: x[1], reverse=True)

This implementation integrates HNSW for fast nearest neighbor search and combines it with NetworkX for graph structure analysis. It relates to the "LLM Integration for Knowledge Graph Enhancement" feature in the roadmap, as it provides an efficient way to search and analyze the knowledge graph created from LLM outputs.

The HNSWGraph class implements the HNSW algorithm for fast nearest neighbor search, while the HybridGraph class extends this functionality by incorporating graph structure analysis using NetworkX. The search method in HybridGraph demonstrates how semantic similarity (via HNSW) and graph structure (via PageRank) can be combined for more comprehensive search results.

This approach is well-integrated with TxtAI's existing ecosystem, utilizing its Embeddings class for text-to-vector conversion. It also leverages popular and well-maintained libraries like hnswlib for HNSW implementation and NetworkX for graph operations, ensuring compatibility and ease of maintenance.

To use this new feature:

graph = HybridGraph(dim=768, max_elements=100000)
graph.add_node("1", "This is a sample text")
graph.add_node("2", "Another example")
graph.add_edge("1", "2")

results = graph.search("sample query", k=5)

This implementation provides a solid foundation for advanced indexing optimization in TxtAI, combining the speed of HNSW with the structural analysis capabilities of graph algorithms.

Citations: [1] https://www.pinecone.io/learn/series/faiss/hnsw/ [2] https://towardsdatascience.com/similarity-search-part-4-hierarchical-navigable-small-world-hnsw-2aad4fe87d37?gi=32ef3efc81f3 [3] https://www.datastax.com/fr/guides/hierarchical-navigable-small-worlds [4] https://github.com/brtholomy/hnsw [5] https://en.wikipedia.org/wiki/Hierarchical_Navigable_Small_World_graphs [6] https://github.com/jelmerk/hnswlib [7] https://docs.oracle.com/en/database/oracle/oracle-database/23/vecse/understand-hierarchical-navigable-small-world-indexes.html [8] https://zilliz.com/learn/hierarchical-navigable-small-worlds-HNSW [9] https://github.com/nmslib/hnswlib [10] https://engineering.atspotify.com/2023/10/introducing-voyager-spotifys-new-nearest-neighbor-search-library/ [11] https://docs.vespa.ai/en/approximate-nn-hnsw.html [12] https://rtriangle.hashnode.dev/approximate-nearest-neighbors-algorithms-and-libraries [13] https://opensearch.org/docs/1.0/search-plugins/knn/approximate-knn/ [14] https://pypi.org/project/hnswlib/ [15] https://github.com/JonasIsensee/hnsw [16] https://myscale.com/blog/master-hnsw-python-step-by-step-guide/ [17] https://pypi.org/project/chroma-hnswlib/ [18] https://snyk.io/advisor/python/hnswlib/example

nicolas-geysse commented 5 months ago

Proposal for implementing Query Optimization in TxtAI:

Feature: Advanced Query Optimization (Relates to: LLM Integration for Knowledge Graph Enhancement) (New Feature Tag)

  1. Develop a query planner that leverages both graph structure and semantic embeddings:
import networkx as nx
from txtai.embeddings import Embeddings
from txtai.graph import Graph

class SemanticQueryPlanner:
    def __init__(self, graph: Graph, embeddings: Embeddings):
        self.graph = graph
        self.embeddings = embeddings

    def plan_query(self, query: str):
        # Get semantic embedding of the query
        query_embedding = self.embeddings.transform(query)

        # Find semantically similar nodes
        similar_nodes = self.find_similar_nodes(query_embedding)

        # Use NetworkX to find optimal paths in the graph
        subgraph = self.graph.graph.subgraph(similar_nodes)
        paths = nx.all_pairs_shortest_path(subgraph)

        # Combine semantic similarity and graph structure for planning
        plan = self.combine_semantic_and_structure(paths, query_embedding)
        return plan

    def find_similar_nodes(self, query_embedding, top_k=10):
        # Find nodes with similar embeddings
        similar = self.embeddings.search(query_embedding, top_k)
        return [node for node, _ in similar]

    def combine_semantic_and_structure(self, paths, query_embedding):
        # Implement logic to combine path information and semantic similarity
        # This is a placeholder for more sophisticated combination logic
        plan = []
        for start, end_dict in paths:
            for end, path in end_dict.items():
                plan.append((start, end, path))
        return plan
  1. Implement query result caching based on semantic similarity:
from functools import lru_cache
import numpy as np

class SemanticCache:
    def __init__(self, embeddings: Embeddings, similarity_threshold=0.9):
        self.embeddings = embeddings
        self.similarity_threshold = similarity_threshold
        self.cache = {}

    @lru_cache(maxsize=1000)
    def get(self, query: str):
        query_embedding = self.embeddings.transform(query)
        for cached_query, (cached_embedding, result) in self.cache.items():
            similarity = np.dot(query_embedding, cached_embedding)
            if similarity > self.similarity_threshold:
                return result
        return None

    def set(self, query: str, result):
        query_embedding = self.embeddings.transform(query)
        self.cache[query] = (query_embedding, result)
  1. Create a cost-based optimizer for complex graph queries:
class CostBasedOptimizer:
    def __init__(self, graph: Graph):
        self.graph = graph

    def optimize(self, query_plan):
        # Implement cost estimation for different query operations
        estimated_costs = self.estimate_costs(query_plan)

        # Use NetworkX's optimization algorithms to find the best plan
        G = nx.DiGraph()
        for i, step in enumerate(query_plan):
            G.add_node(i, cost=estimated_costs[i])
            if i > 0:
                G.add_edge(i-1, i)

        optimal_path = nx.dag_longest_path(G)
        return [query_plan[i] for i in optimal_path]

    def estimate_costs(self, query_plan):
        # Placeholder for cost estimation logic
        # This should be replaced with more sophisticated cost models
        return [len(step) for step in query_plan]

Integration with TxtAI:

This implementation leverages TxtAI's existing Graph and Embeddings classes, ensuring compatibility with the current ecosystem. It also utilizes NetworkX for graph algorithms, which is already used in TxtAI.

Usage example:

graph = Graph()
embeddings = Embeddings()
planner = SemanticQueryPlanner(graph, embeddings)
cache = SemanticCache(embeddings)
optimizer = CostBasedOptimizer(graph)

query = "Find connections between AI and healthcare"
initial_plan = planner.plan_query(query)

if cached_result := cache.get(query):
    print("Using cached result")
    result = cached_result
else:
    optimized_plan = optimizer.optimize(initial_plan)
    result = execute_plan(optimized_plan)  # This function needs to be implemented
    cache.set(query, result)

print(result)

This feature enhances the "LLM Integration for Knowledge Graph Enhancement" part of the roadmap by providing advanced query optimization capabilities. It combines semantic understanding from embeddings with graph structure analysis to create more efficient query plans. The semantic caching mechanism helps in reducing redundant computations for similar queries, while the cost-based optimizer ensures that complex graph queries are executed in the most efficient manner possible.

The implementation is designed to be simple and well-integrated with TxtAI's existing components, using NetworkX for graph algorithms and building upon TxtAI's Graph and Embeddings classes. This approach ensures that the new feature fits seamlessly into the TxtAI ecosystem while providing powerful query optimization capabilities.

Citations: [1] https://arxiv.org/abs/1609.01893 [2] https://arxiv.org/pdf/1609.01893.pdf [3] https://www.researchgate.net/publication/307896614_Query_Optimization_Techniques_In_Graph_Databases [4] https://ceur-ws.org/Vol-3452/paper9.pdf [5] https://memgraph.com/blog/optimizing-graph-databases-through-denormalization [6] https://tspace.library.utoronto.ca/handle/1807/130280 [7] https://eecs.wsu.edu/~jana/pubs/learning-to-speedup-graph-databases-ICAPS2017.pdf [8] https://www.semanticscholar.org/paper/Query-Optimization-Techniques-In-Graph-Databases-Ammar/5685a394b25fcb27b6ad91f7325f2e60a9892e2a [9] https://www.graft.com/blog/optimize-your-semantic-search-engine [10] https://myscale.com/blog/mastering-semantic-search-embedding-techniques/ [11] https://lintool.github.io/robust04-analysis-papers/p123-zamani.pdf [12] https://rockset.com/blog/introduction-to-semantic-search-embeddings-similarity-metrics-vector-dbs/ [13] https://myscale.com/blog/best-embedding-models-semantic-search-comparison/ [14] https://cohere.com/blog/what-is-semantic-search [15] https://www.sbert.net/examples/applications/semantic-search/README.html [16] https://www.linkedin.com/pulse/building-semantic-search-engine-dual-space-word-embeddings-magetech [17] https://dl.acm.org/doi/10.1145/3511808.3557197 [18] https://www.wict.pku.edu.cn/docs/20230529103705875645.pdf [19] https://dl.acm.org/doi/pdf/10.1145/3511808.3557197 [20] https://docs.tigergraph.com/gsql-ref/current/querying/query-optimizer/enable-cost-optimizer