run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
37.05k stars 5.32k forks source link

[Bug]: EmptyNetworkError #15251

Open 18811449050 opened 3 months ago

18811449050 commented 3 months ago

Bug Description

Note: There are more relationships that can be extracted from the text, but I have only provided two entity-relation triplets as per your request. Extracting paths from text: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [01:39<00:00, 9.92s/it] Generating embeddings: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.07it/s] Generating embeddings: 0it [00:00, ?it/s] Received notification from DBMS server: {severity: WARNING} {code: Neo.ClientNotification.Statement.FeatureDeprecationWarning} {category: DEPRECATION} {title: This feature is deprecated and will be removed in future versions.} {description: The procedure has a deprecated field. ('config' used by 'apoc.meta.graphSample' is deprecated.)} {position: line: 1, column: 1, offset: 0} for query: "CALL apoc.meta.graphSample() YIELD nodes, relationships RETURN nodes, [rel in relationships | {name:apoc.any.property(rel, 'type'), count: apoc.any.property(rel, 'count')}] AS relationships" index.property_graph_store: <main.GraphRAGStore object at 0x7f6b54528df0> Received notification from DBMS server: {severity: WARNING} {code: Neo.ClientNotification.Statement.UnknownPropertyKeyWarning} {category: UNRECOGNIZED} {title: The provided property key is not in the database} {description: One of the property names in your query is not available in the database, make sure you didn't misspell it or that the label is available when you run this statement in your application (the missing property name is: name)} {position: line: 6, column: 22, offset: 138} for query: "MATCH (e:__Entity__) \n WITH e\n CALL {\n WITH e\n MATCH (e)-[r]->(t:__Entity__)\n RETURN e.name AS source_id, [l in labels(e) WHERE NOT l IN ['Entity', 'Node'] | l][0] AS source_type,\n e{. , embedding: Null, name: Null} AS source_properties,\n type(r) AS type,\n r{.} AS rel_properties,\n t.name AS target_id, [l in labels(t) WHERE NOT l IN ['Entity', 'Node'] | l][0] AS target_type,\n t{. , embedding: Null, name: Null} AS target_properties\n UNION ALL\n WITH e\n MATCH (e)<-[r]-(t:__Entity__)\n RETURN t.name AS source_id, [l in labels(t) WHERE NOT l IN ['Entity', 'Node'] | l][0] AS source_type,\n t{. , embedding: Null, name: Null} AS source_properties,\n type(r) AS type,\n r{.} AS rel_properties,\n e.name AS target_id, [l in labels(e) WHERE NOT l IN ['Entity', 'Node'] | l][0] AS target_type,\n e{. , embedding: Null, name: Null} AS target_properties\n }\n RETURN source_id, source_type, type, rel_properties, target_id, target_type, source_properties, target_properties" Received notification from DBMS server: {severity: WARNING} {code: Neo.ClientNotification.Statement.UnknownPropertyKeyWarning} {category: UNRECOGNIZED} {title: The provided property key is not in the database} {description: One of the property names in your query is not available in the database, make sure you didn't misspell it or that the label is available when you run this statement in your application (the missing property name is: name)} {position: line: 10, column: 22, offset: 417} for query: "MATCH (e:__Entity__) \n WITH e\n CALL {\n WITH e\n MATCH (e)-[r]->(t:__Entity__)\n RETURN e.name AS source_id, [l in labels(e) WHERE NOT l IN ['Entity', 'Node'] | l][0] AS source_type,\n e{. , embedding: Null, name: Null} AS source_properties,\n type(r) AS type,\n r{.} AS rel_properties,\n t.name AS target_id, [l in labels(t) WHERE NOT l IN ['Entity', 'Node'] | l][0] AS target_type,\n t{. , embedding: Null, name: Null} AS target_properties\n UNION ALL\n WITH e\n MATCH (e)<-[r]-(t:__Entity__)\n RETURN t.name AS source_id, [l in labels(t) WHERE NOT l IN ['Entity', 'Node'] | l][0] AS source_type,\n t{. , embedding: Null, name: Null} AS source_properties,\n type(r) AS type,\n r{.} AS rel_properties,\n e.name AS target_id, [l in labels(e) WHERE NOT l IN ['Entity', 'Node'] | l][0] AS target_type,\n e{. , embedding: Null, name: Null} AS target_properties\n }\n RETURN source_id, source_type, type, rel_properties, target_id, target_type, source_properties, target_properties" Received notification from DBMS server: {severity: WARNING} {code: Neo.ClientNotification.Statement.UnknownPropertyKeyWarning} {category: UNRECOGNIZED} {title: The provided property key is not in the database} {description: One of the property names in your query is not available in the database, make sure you didn't misspell it or that the label is available when you run this statement in your application (the missing property name is: name)} {position: line: 15, column: 22, offset: 700} for query: "MATCH (e:__Entity__) \n WITH e\n CALL {\n WITH e\n MATCH (e)-[r]->(t:__Entity__)\n RETURN e.name AS source_id, [l in labels(e) WHERE NOT l IN ['Entity', 'Node'] | l][0] AS source_type,\n e{. , embedding: Null, name: Null} AS source_properties,\n type(r) AS type,\n r{.} AS rel_properties,\n t.name AS target_id, [l in labels(t) WHERE NOT l IN ['Entity', 'Node'] | l][0] AS target_type,\n t{. , embedding: Null, name: Null} AS target_properties\n UNION ALL\n WITH e\n MATCH (e)<-[r]-(t:__Entity__)\n RETURN t.name AS source_id, [l in labels(t) WHERE NOT l IN ['Entity', 'Node'] | l][0] AS source_type,\n t{. , embedding: Null, name: Null} AS source_properties,\n type(r) AS type,\n r{.} AS rel_properties,\n e.name AS target_id, [l in labels(e) WHERE NOT l IN ['Entity', 'Node'] | l][0] AS target_type,\n e{. , embedding: Null, name: Null} AS target_properties\n }\n RETURN source_id, source_type, type, rel_properties, target_id, target_type, source_properties, target_properties" Received notification from DBMS server: {severity: WARNING} {code: Neo.ClientNotification.Statement.UnknownPropertyKeyWarning} {category: UNRECOGNIZED} {title: The provided property key is not in the database} {description: One of the property names in your query is not available in the database, make sure you didn't misspell it or that the label is available when you run this statement in your application (the missing property name is: name)} {position: line: 19, column: 22, offset: 979} for query: "MATCH (e:__Entity__) \n WITH e\n CALL {\n WITH e\n MATCH (e)-[r]->(t:__Entity__)\n RETURN e.name AS source_id, [l in labels(e) WHERE NOT l IN ['Entity', 'Node'] | l][0] AS source_type,\n e{. , embedding: Null, name: Null} AS source_properties,\n type(r) AS type,\n r{.} AS rel_properties,\n t.name AS target_id, [l in labels(t) WHERE NOT l IN ['Entity', 'Node'] | l][0] AS target_type,\n t{. , embedding: Null, name: Null} AS target_properties\n UNION ALL\n WITH e\n MATCH (e)<-[r]-(t:__Entity__)\n RETURN t.name AS source_id, [l in labels(t) WHERE NOT l IN ['Entity', 'Node'] | l][0] AS source_type,\n t{. , embedding: Null, name: Null} AS source_properties,\n type(r) AS type,\n r{.} AS rel_properties,\n e.name AS target_id, [l in labels(e) WHERE NOT l IN ['Entity', 'Node'] | l][0] AS target_type,\n e{. , embedding: Null, name: Null} AS target_properties\n }\n RETURN source_id, source_type, type, rel_properties, target_id, target_type, source_properties, target_properties" 生成的三元组: [] Received notification from DBMS server: {severity: WARNING} {code: Neo.ClientNotification.Statement.UnknownPropertyKeyWarning} {category: UNRECOGNIZED} {title: The provided property key is not in the database} {description: One of the property names in your query is not available in the database, make sure you didn't misspell it or that the label is available when you run this statement in your application (the missing property name is: name)} {position: line: 6, column: 22, offset: 138} for query: "MATCH (e:__Entity__) \n WITH e\n CALL {\n WITH e\n MATCH (e)-[r]->(t:__Entity__)\n RETURN e.name AS source_id, [l in labels(e) WHERE NOT l IN ['Entity', 'Node'] | l][0] AS source_type,\n e{. , embedding: Null, name: Null} AS source_properties,\n type(r) AS type,\n r{.} AS rel_properties,\n t.name AS target_id, [l in labels(t) WHERE NOT l IN ['Entity', 'Node'] | l][0] AS target_type,\n t{. , embedding: Null, name: Null} AS target_properties\n UNION ALL\n WITH e\n MATCH (e)<-[r]-(t:__Entity__)\n RETURN t.name AS source_id, [l in labels(t) WHERE NOT l IN ['Entity', 'Node'] | l][0] AS source_type,\n t{. , embedding: Null, name: Null} AS source_properties,\n type(r) AS type,\n r{.} AS rel_properties,\n e.name AS target_id, [l in labels(e) WHERE NOT l IN ['Entity', 'Node'] | l][0] AS target_type,\n e{. , embedding: Null, name: Null} AS target_properties\n }\n RETURN source_id, source_type, type, rel_properties, target_id, target_type, source_properties, target_properties" Received notification from DBMS server: {severity: WARNING} {code: Neo.ClientNotification.Statement.UnknownPropertyKeyWarning} {category: UNRECOGNIZED} {title: The provided property key is not in the database} {description: One of the property names in your query is not available in the database, make sure you didn't misspell it or that the label is available when you run this statement in your application (the missing property name is: name)} {position: line: 10, column: 22, offset: 417} for query: "MATCH (e:__Entity__) \n WITH e\n CALL {\n WITH e\n MATCH (e)-[r]->(t:__Entity__)\n RETURN e.name AS source_id, [l in labels(e) WHERE NOT l IN ['Entity', 'Node'] | l][0] AS source_type,\n e{. , embedding: Null, name: Null} AS source_properties,\n type(r) AS type,\n r{.} AS rel_properties,\n t.name AS target_id, [l in labels(t) WHERE NOT l IN ['Entity', 'Node'] | l][0] AS target_type,\n t{. , embedding: Null, name: Null} AS target_properties\n UNION ALL\n WITH e\n MATCH (e)<-[r]-(t:__Entity__)\n RETURN t.name AS source_id, [l in labels(t) WHERE NOT l IN ['Entity', 'Node'] | l][0] AS source_type,\n t{. , embedding: Null, name: Null} AS source_properties,\n type(r) AS type,\n r{.} AS rel_properties,\n e.name AS target_id, [l in labels(e) WHERE NOT l IN ['Entity', 'Node'] | l][0] AS target_type,\n e{. , embedding: Null, name: Null} AS target_properties\n }\n RETURN source_id, source_type, type, rel_properties, target_id, target_type, source_properties, target_properties" Received notification from DBMS server: {severity: WARNING} {code: Neo.ClientNotification.Statement.UnknownPropertyKeyWarning} {category: UNRECOGNIZED} {title: The provided property key is not in the database} {description: One of the property names in your query is not available in the database, make sure you didn't misspell it or that the label is available when you run this statement in your application (the missing property name is: name)} {position: line: 15, column: 22, offset: 700} for query: "MATCH (e:__Entity__) \n WITH e\n CALL {\n WITH e\n MATCH (e)-[r]->(t:__Entity__)\n RETURN e.name AS source_id, [l in labels(e) WHERE NOT l IN ['Entity', 'Node'] | l][0] AS source_type,\n e{. , embedding: Null, name: Null} AS source_properties,\n type(r) AS type,\n r{.} AS rel_properties,\n t.name AS target_id, [l in labels(t) WHERE NOT l IN ['Entity', 'Node'] | l][0] AS target_type,\n t{. , embedding: Null, name: Null} AS target_properties\n UNION ALL\n WITH e\n MATCH (e)<-[r]-(t:__Entity__)\n RETURN t.name AS source_id, [l in labels(t) WHERE NOT l IN ['Entity', 'Node'] | l][0] AS source_type,\n t{. , embedding: Null, name: Null} AS source_properties,\n type(r) AS type,\n r{.} AS rel_properties,\n e.name AS target_id, [l in labels(e) WHERE NOT l IN ['Entity', 'Node'] | l][0] AS target_type,\n e{. , embedding: Null, name: Null} AS target_properties\n }\n RETURN source_id, source_type, type, rel_properties, target_id, target_type, source_properties, target_properties" Received notification from DBMS server: {severity: WARNING} {code: Neo.ClientNotification.Statement.UnknownPropertyKeyWarning} {category: UNRECOGNIZED} {title: The provided property key is not in the database} {description: One of the property names in your query is not available in the database, make sure you didn't misspell it or that the label is available when you run this statement in your application (the missing property name is: name)} {position: line: 19, column: 22, offset: 979} for query: "MATCH (e:__Entity__) \n WITH e\n CALL {\n WITH e\n MATCH (e)-[r]->(t:__Entity__)\n RETURN e.name AS source_id, [l in labels(e) WHERE NOT l IN ['Entity', 'Node'] | l][0] AS source_type,\n e{. , embedding: Null, name: Null} AS source_properties,\n type(r) AS type,\n r{.} AS rel_properties,\n t.name AS target_id, [l in labels(t) WHERE NOT l IN ['Entity', 'Node'] | l][0] AS target_type,\n t{. , embedding: Null, name: Null} AS target_properties\n UNION ALL\n WITH e\n MATCH (e)<-[r]-(t:__Entity__)\n RETURN t.name AS source_id, [l in labels(t) WHERE NOT l IN ['Entity', 'Node'] | l][0] AS source_type,\n t{. , embedding: Null, name: Null} AS source_properties,\n type(r) AS type,\n r{.} AS rel_properties,\n e.name AS target_id, [l in labels(e) WHERE NOT l IN ['Entity', 'Node'] | l][0] AS target_type,\n e{. , embedding: Null, name: Null} AS target_properties\n }\n RETURN source_id, source_type, type, rel_properties, target_id, target_type, source_properties, target_properties" Traceback (most recent call last): File "/data1/mgl/llama_index_security_project/linshi_v1.py", line 451, in index.property_graph_store.build_communities() File "/data1/mgl/llama_index_security_project/linshi_v1.py", line 190, in build_communities community_hierarchical_clusters = hierarchical_leiden( File "<@beartype(graspologic.partition.leiden.hierarchical_leiden) at 0x7f6b5879b940>", line 304, in hierarchical_leiden File "/root/miniconda3/envs/llama_index_env1/lib/python3.9/site-packages/graspologic/partition/leiden.py", line 588, in hierarchical_leiden hierarchical_clusters_native = gn.hierarchical_leiden( leiden.EmptyNetworkError: EmptyNetworkError

Version

stable

Steps to Reproduce

I checked the intermediate output. In the response_str in the parse_fn method, I found that there are entities and relationships, but in index.property_graph_store.get_triplets() it is empty. Why?

Relevant Logs/Tracbacks

import pandas as pd
from llama_index.core import Document
from llama_index.graph_stores.neo4j import Neo4jPropertyGraphStore
import asyncio
import nest_asyncio

from typing import Any, List, Callable, Optional, Union, Dict
# from IPython.display import Markdown, display

import re
import networkx as nx
from graspologic.partition import hierarchical_leiden
from collections import defaultdict

from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.core.query_engine import CustomQueryEngine
from llama_index.core.llms import LLM
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import Settings, PropertyGraphIndex
from llama_index.core.llms import ChatMessage
from llama_index.graph_stores.neo4j import Neo4jPropertyGraphStore
from llama_index.core.async_utils import run_jobs
from llama_index.core.indices.property_graph.utils import (
    default_parse_triplets_fn,
)
from llama_index.core.graph_stores.types import (
    EntityNode,
    KG_NODES_KEY,
    KG_RELATIONS_KEY,
    Relation,
)
from llama_index.core.llms.llm import LLM
from llama_index.core.prompts import PromptTemplate
from llama_index.core.prompts.default_prompts import (DEFAULT_KG_TRIPLET_EXTRACT_PROMPT,)
from llama_index.core.schema import TransformComponent, BaseNode
from llama_index.core.bridge.pydantic import BaseModel, Field

class GraphRAGExtractor(TransformComponent):
    """Extract triples from a graph.

    Uses an LLM and a simple prompt + output parsing to extract paths (i.e. triples) and entity, relation descriptions from text.

    Args:
        llm (LLM):
            The language model to use.
        extract_prompt (Union[str, PromptTemplate]):
            The prompt to use for extracting triples.
        parse_fn (callable):
            A function to parse the output of the language model.
        num_workers (int):
            The number of workers to use for parallel processing.
        max_paths_per_chunk (int):
            The maximum number of paths to extract per chunk.
    """

    llm: LLM
    extract_prompt: PromptTemplate
    parse_fn: Callable
    num_workers: int
    max_paths_per_chunk: int

    def __init__(
        self,
        llm: Optional[LLM] = None,
        extract_prompt: Optional[Union[str, PromptTemplate]] = None,
        parse_fn: Callable = default_parse_triplets_fn,
        max_paths_per_chunk: int = 10,
        num_workers: int = 4,
    ) -> None:
        """Init params."""
        from llama_index.core import Settings

        if isinstance(extract_prompt, str):
            extract_prompt = PromptTemplate(extract_prompt)

        super().__init__(
            llm=llm or Settings.llm,
            extract_prompt=extract_prompt or DEFAULT_KG_TRIPLET_EXTRACT_PROMPT,
            parse_fn=parse_fn,
            num_workers=num_workers,
            max_paths_per_chunk=max_paths_per_chunk,
        )

    @classmethod
    def class_name(cls) -> str:
        return "GraphExtractor"

    def __call__(
        self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any
    ) -> List[BaseNode]:
        """Extract triples from nodes."""
        return asyncio.run(
            self.acall(nodes, show_progress=show_progress, **kwargs)
        )

    async def _aextract(self, node: BaseNode) -> BaseNode:
    # def _aextract(self, node: BaseNode) -> BaseNode:
        """Extract triples from a node."""
        assert hasattr(node, "text")

        text = node.get_content(metadata_mode="llm")
        try:
            llm_response = await self.llm.apredict(
            # llm_response = self.llm.apredict(
                self.extract_prompt,
                text=text,
                max_knowledge_triplets=self.max_paths_per_chunk,
            )
            entities, entities_relationship = self.parse_fn(llm_response)
        except ValueError:
            entities = []
            entities_relationship = []

        existing_nodes = node.metadata.pop(KG_NODES_KEY, [])
        existing_relations = node.metadata.pop(KG_RELATIONS_KEY, [])
        entity_metadata = node.metadata.copy()
        for entity, entity_type, description in entities:
            entity_metadata["entity_description"] = description
            entity_node = EntityNode(
                name=entity, label=entity_type, properties=entity_metadata
            )
            existing_nodes.append(entity_node)

        relation_metadata = node.metadata.copy()
        for triple in entities_relationship:
            subj, obj, rel, description = triple
            relation_metadata["relationship_description"] = description
            rel_node = Relation(
                label=rel,
                source_id=subj,
                target_id=obj,
                properties=relation_metadata,
            )

            existing_relations.append(rel_node)

        node.metadata[KG_NODES_KEY] = existing_nodes
        node.metadata[KG_RELATIONS_KEY] = existing_relations
        return node

    async def acall(
    # def acall(
        self, nodes: List[BaseNode], show_progress: bool = False, **kwargs: Any
    ) -> List[BaseNode]:
        """Extract triples from nodes async."""
        jobs = []
        for node in nodes:
            jobs.append(self._aextract(node))

        return await run_jobs(
        # return run_jobs(
            jobs,
            workers=self.num_workers,
            show_progress=show_progress,
            desc="Extracting paths from text",
        )

class GraphRAGStore(Neo4jPropertyGraphStore):
    community_summary = {}
    entity_info = None
    max_cluster_size = 5

    def generate_community_summary(self, text):
        """Generate summary for a given text using an LLM."""
        messages = [
            ChatMessage(
                role="system",
                content=(
                    "You are provided with a set of relationships from a knowledge graph, each represented as "
                    "entity1->entity2->relation->relationship_description. Your task is to create a summary of these "
                    "relationships. The summary should include the names of the entities involved and a concise synthesis "
                    "of the relationship descriptions. The goal is to capture the most critical and relevant details that "
                    "highlight the nature and significance of each relationship. Ensure that the summary is coherent and "
                    "integrates the information in a way that emphasizes the key aspects of the relationships."
                ),
            ),
            ChatMessage(role="user", content=text),
        ]
        response = OpenAI().chat(messages)
        print("response:", response)
        clean_response = re.sub(r"^assistant:\s*", "", str(response)).strip()
        return clean_response

    def build_communities(self):
        """Builds communities from the graph and summarizes them."""
        nx_graph = self._create_nx_graph()
        community_hierarchical_clusters = hierarchical_leiden(
            nx_graph, max_cluster_size=self.max_cluster_size
        )
        self.entity_info, community_info = self._collect_community_info(
            nx_graph, community_hierarchical_clusters
        )
        self._summarize_communities(community_info)

    def _create_nx_graph(self):
        """Converts internal graph representation to NetworkX graph."""
        nx_graph = nx.Graph()
        triplets = self.get_triplets()
        for entity1, relation, entity2 in triplets:
            nx_graph.add_node(entity1.name)
            nx_graph.add_node(entity2.name)
            nx_graph.add_edge(
                relation.source_id,
                relation.target_id,
                relationship=relation.label,
                description=relation.properties["relationship_description"],
            )
        return nx_graph

    def _collect_community_info(self, nx_graph, clusters):
        """
        Collect information for each node based on their community,
        allowing entities to belong to multiple clusters.
        """
        entity_info = defaultdict(set)
        community_info = defaultdict(list)

        for item in clusters:
            node = item.node
            cluster_id = item.cluster

            # Update entity_info
            entity_info[node].add(cluster_id)

            for neighbor in nx_graph.neighbors(node):
                edge_data = nx_graph.get_edge_data(node, neighbor)
                if edge_data:
                    detail = f"{node} -> {neighbor} -> {edge_data['relationship']} -> {edge_data['description']}"
                    community_info[cluster_id].append(detail)

        # Convert sets to lists for easier serialization if needed
        entity_info = {k: list(v) for k, v in entity_info.items()}

        return dict(entity_info), dict(community_info)

    def _summarize_communities(self, community_info):
        """Generate and store summaries for each community."""
        for community_id, details in community_info.items():
            details_text = (
                "\n".join(details) + "."
            )  # Ensure it ends with a period
            self.community_summary[
                community_id
            ] = self.generate_community_summary(details_text)

    def get_community_summaries(self):
        """Returns the community summaries, building them if not already done."""
        if not self.community_summary:
            self.build_communities()
        return self.community_summary

class GraphRAGQueryEngine(CustomQueryEngine):
    graph_store: GraphRAGStore
    index: PropertyGraphIndex
    llm: LLM
    similarity_top_k: int = 20

    def custom_query(self, query_str: str) -> str:
        """Process all community summaries to generate answers to a specific query."""

        entities = self.get_entities(query_str, self.similarity_top_k)

        community_ids = self.retrieve_entity_communities(
            self.graph_store.entity_info, entities
        )
        community_summaries = self.graph_store.get_community_summaries()
        community_answers = [
            self.generate_answer_from_summary(community_summary, query_str)
            for id, community_summary in community_summaries.items()
            if id in community_ids
        ]

        final_answer = self.aggregate_answers(community_answers)
        return final_answer

    def get_entities(self, query_str, similarity_top_k):
        nodes_retrieved = index.as_retriever(
            similarity_top_k=similarity_top_k
        ).retrieve(query_str)

        enitites = set()
        pattern = r"(\w+(?:\s+\w+)*)\s*{[^}]*}{[^}]*}{[^}]*}\s*->\s*([^(]+?)\s*{[^}]*}{[^}]*}{[^}]*}\s*->\s*(\w+(?:\s+\w+)*)"

        for node in nodes_retrieved:
            matches = re.findall(pattern, node.text, re.DOTALL)

            for match in matches:
                subject = match[0]
                obj = match[2]
                enitites.add(subject)
                enitites.add(obj)

        return list(enitites)

    def retrieve_entity_communities(self, entity_info, entities):
        """
        Retrieve cluster information for given entities, allowing for multiple clusters per entity.

        Args:
        entity_info (dict): Dictionary mapping entities to their cluster IDs (list).
        entities (list): List of entity names to retrieve information for.

        Returns:
        List of community or cluster IDs to which an entity belongs.
        """
        community_ids = []

        for entity in entities:
            if entity in entity_info:
                community_ids.extend(entity_info[entity])

        return list(set(community_ids))

    def generate_answer_from_summary(self, community_summary, query):
        """Generate an answer from a community summary based on a given query using LLM."""
        prompt = (
            f"Given the community summary: {community_summary}, "
            f"how would you answer the following query? Query: {query}"
        )
        messages = [
            ChatMessage(role="system", content=prompt),
            ChatMessage(
                role="user",
                content="I need an answer based on the above information.",
            ),
        ]
        response = self.llm.chat(messages)
        cleaned_response = re.sub(r"^assistant:\s*", "", str(response)).strip()
        return cleaned_response

    def aggregate_answers(self, community_answers):
        """Aggregate individual community answers into a final, coherent response."""
        # intermediate_text = " ".join(community_answers)
        prompt = "Combine the following intermediate answers into a final, concise response."
        messages = [
            ChatMessage(role="system", content=prompt),
            ChatMessage(
                role="user",
                content=f"Intermediate answers: {community_answers}",
            ),
        ]
        final_response = self.llm.chat(messages)
        cleaned_final_response = re.sub(
            r"^assistant:\s*", "", str(final_response)
        ).strip()
        return cleaned_final_response

def parse_fn(response_str: str) -> Any:
    print("response_str:", response_str)
    entities = re.findall(entity_pattern, response_str)
    relationships = re.findall(relationship_pattern, response_str)
    return entities, relationships

# nest_asyncio.apply()

llm = Ollama(
            model="llama3:70b",
            request_timeout=1200.0,
            context_window=3900,
            json_mode=False,
            temperature=0.1,
            # num_output=256
        )

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
Settings.embed_model = HuggingFaceEmbedding(model_name="/data1/mgl/model/bge-large-zh-v1.5-model/")
# Settings.embed_model = OllamaEmbedding(
#         model_name="quentinz/bge-large-zh-v1.5:latest",
#         base_url="http://localhost:11434",
#         ollama_additional_kwargs={"mirostat": 0},
#     )

news = pd.read_csv("./news_articles.csv")[:10]
print("news:", news)
print("原始数据查看:", news["text"].tolist())
documents = [
    Document(text=f"{row['title']}: {row['text']}")
    for i, row in news.iterrows()
]

splitter = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=20,
)
nodes = splitter.get_nodes_from_documents(documents)

# Note: used to be `Neo4jPGStore`
graph_store = GraphRAGStore(username="neo4j", password="neo4j", url="bolt://0.0.0.0:7687")

KG_TRIPLET_EXTRACT_TMPL = """
-Goal-
Given a text document, identify all entities and their entity types from the text and all relationships among the identified entities.
Given the text, extract up to {max_knowledge_triplets} entity-relation triplets.

-Steps-
1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, capitalized
- entity_type: Type of the entity
- entity_description: Comprehensive description of the entity's attributes and activities
Format each entity as ("entity")

2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
For each pair of related entities, extract the following information:
- source_entity: name of the source entity, as identified in step 1
- target_entity: name of the target entity, as identified in step 1
- relation: relationship between source_entity and target_entity
- relationship_description: explanation as to why you think the source entity and the target entity are related to each other

Format each relationship as ("relationship")

3. When finished, output.

-Real Data-
######################
text: {text}
######################
output:"""

entity_pattern = r'"entity"$$$$"(.+?)"$$$$"(.+?)"$$$$"(.+?)""entity"$$$$"(.+?)"$$$$"(.+?)"$$$$"(.+?)""entity"\$\$\$\$"(.+?)"\$\$\$\$"(.+?)"\$\$\$\$"(.+?)"'
relationship_pattern = r'"relationship"$$$$"(.+?)"$$$$"(.+?)"$$$$"(.+?)"$$$$"(.+?)""relationship"$$$$"(.+?)"$$$$"(.+?)"$$$$"(.+?)"$$$$"(.+?)""relationship"\$\$\$\$"(.+?)"\$\$\$\$"(.+?)"\$\$\$\$"(.+?)"\$\$\$\$"(.+?)"'

kg_extractor = GraphRAGExtractor(
    llm=llm,
    extract_prompt=KG_TRIPLET_EXTRACT_TMPL,
    max_paths_per_chunk=2,
    parse_fn=parse_fn,
)

index = PropertyGraphIndex(
    nodes=nodes,
    kg_extractors=[kg_extractor],
    property_graph_store=graph_store,
    show_progress=True,
    embed_model=Settings.embed_model
)

print("index.property_graph_store:", index.property_graph_store)
print("生成的三元组:", index.property_graph_store.get_triplets())
# print("index.property_graph_store.get_triplets()[10][0].properties:", index.property_graph_store.get_triplets()[10][0].properties)
# print("index.property_graph_store.get_triplets()[10][1].properties:", index.property_graph_store.get_triplets()[10][1].properties)

index.property_graph_store.build_communities()
18811449050 commented 3 months ago

Why does build_communities() report such an error?

dosubot[bot] commented 3 months ago

The build_communities() method in the GraphRAGStore class is reporting an EmptyNetworkError because the get_triplets() method is returning an empty list. This indicates that no triplets are being retrieved from the graph store, leading to an empty NetworkX graph, which in turn causes the error when attempting to build communities.

Here are some potential reasons why get_triplets() might be returning an empty list:

  1. No Parameters Provided: The get_triplets() method will return an empty list if none of the parameters (entity_names, relation_names, properties, ids) are provided. This is explicitly handled by the initial check in the method:

    if not ids and not properties and not entity_names and not relation_names:
       return []
  2. No Matching Records: Even if parameters are provided, the method might still return an empty list if the provided parameters do not match any records in the database. For example, if the entity_names, relation_names, properties, or ids do not correspond to any existing nodes or relationships, the query will not return any results.

  3. Database Issues: The warnings from the Neo4j DBMS server indicate that some property keys are not recognized. This could mean that the properties being queried do not exist in the database, leading to no results being returned.

To address this issue, you should:

  1. Verify Parameters: Ensure that the parameters being passed to get_triplets() are correct and match the data in your database.

  2. Check Database Schema: Verify that the properties and labels being queried exist in the database. The warnings about unknown property keys suggest that there might be a mismatch between the expected schema and the actual data.

  3. Debugging: Add logging or print statements to check the intermediate outputs and ensure that the data is being processed correctly. For example, you can print the parameters being passed to get_triplets() and the results of the database queries.

By addressing these points, you should be able to identify why get_triplets() is returning an empty list and resolve the EmptyNetworkError in the build_communities() method [1][2][3][4].

To continue talking to Dosu, mention @dosu.

qizhanghw commented 3 months ago

I have the same problem. Have you solved it?