912100012 commented 2 months ago

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

I already understand the principle of GraphRAG Implementation, but there is a method in the V2 version that I don't understand. For example, the retrieve method in the method returns semantically relevant nodes, so this node must be native chunk text, so how dose it match this regular as follow?

    nodes_retrieved = self.index.as_retriever(
            similarity_top_k=similarity_top_k
        ).retrieve(query_str)
    pattern = r"(\w+(?:\s+\w+)*)\s*\({[^}]*}\)\s*->\s*([^(]+?)\s*\({[^}]*}\)\s*->\s*(\w+(?:\s+\w+)*)"

Well, Isn't this always an empty set after match?

dosubot[bot] commented 2 months ago

I found a similar open issue that might help: [Question]: How does GraphIndex retrieve nodes? Does it rely on LLM?. The issue is closed and provides some insights into node retrieval mechanisms.

Regarding your specific question about the retrieve method in the GraphRAG implementation with LlamaIndex - V2, the method matches the regular expression pattern using the re module in Python. The retrieve method returns semantically relevant nodes, which are then processed to match the provided regular expression pattern.

Here is the relevant code snippet:

nodes_retrieved = self.index.as_retriever(
        similarity_top_k=similarity_top_k
    ).retrieve(query_str)
pattern = r"(\w+(?:\s+\w+)*)\s*\({[^}]*}\)\s*->\s*([^(]+?)\s*\({[^}]*}\)\s*->\s*(\w+(?:\s+\w+)*)"

The regular expression r"(\w+(?:\s+\w+)*)\s*\({[^}]*}\)\s*->\s*([^(]+?)\s*\({[^}]*}\)\s*->\s*(\w+(?:\s+\w+)*)" is designed to match a specific pattern in the text of the retrieved nodes. This pattern looks for sequences of words followed by specific characters and relationships.

The reason it might always result in an empty set could be due to the response from the LLM not containing the expected format that matches the regular expression. If the text in the retrieved nodes does not conform to the pattern specified by the regular expression, the re.findall function will return an empty list.

To ensure that the pattern matches, you might need to verify the format of the text in the retrieved nodes and adjust the regular expression accordingly. If the text format is different, the regular expression needs to be updated to match the actual format of the text.

For more details, you can refer to the GraphRAG_v2.ipynb notebook ^[1].

To continue talking to Dosu, mention @dosu.

912100012 commented 2 months ago

两个版本的不同点的意义是什么？@dosu

def custom_query(self, query_str: str) -> str:
        """Process all community summaries to generate answers to a specific query."""
        community_summaries = self.graph_store.get_community_summaries()
        community_answers = [
            self.generate_answer_from_summary(community_summary, query_str)
            for _, community_summary in community_summaries.items()
        ]

        final_answer = self.aggregate_answers(community_answers)
        return final_answer

def custom_query(self, query_str: str) -> str:
        """Process all community summaries to generate answers to a specific query."""

        entities = self.get_entities(query_str, self.similarity_top_k)

        community_ids = self.retrieve_entity_communities(
            self.graph_store.entity_info, entities
        )
        community_summaries = self.graph_store.get_community_summaries()
        community_answers = [
            self.generate_answer_from_summary(community_summary, query_str)
            for id, community_summary in community_summaries.items()
            if id in community_ids
        ]

        final_answer = self.aggregate_answers(community_answers)
        return final_answer

dosubot[bot] commented 1 month ago

The significance of the differences between the two versions of the custom_query method lies in their approach to processing community summaries and generating answers:

V1 Version

def custom_query(self, query_str: str) -> str:
    """Process all community summaries to generate answers to a specific query."""
    community_summaries = self.graph_store.get_community_summaries()
    community_answers = [
        self.generate_answer_from_summary(community_summary, query_str)
        for _, community_summary in community_summaries.items()
    ]

    final_answer = self.aggregate_answers(community_answers)
    return final_answer

Approach: This version processes all community summaries without any filtering.
Steps:
1. Retrieves all community summaries from graph_store.
2. Generates answers for each community summary using the generate_answer_from_summary method.
3. Aggregates these answers into a final response using the aggregate_answers method.
Efficiency: This approach may be less efficient as it processes all community summaries, regardless of their relevance to the query.

V2 Version

def custom_query(self, query_str: str) -> str:
    """Process all community summaries to generate answers to a specific query."""

    entities = self.get_entities(query_str, self.similarity_top_k)

    community_ids = self.retrieve_entity_communities(
        self.graph_store.entity_info, entities
    )
    community_summaries = self.graph_store.get_community_summaries()
    community_answers = [
        self.generate_answer_from_summary(community_summary, query_str)
        for id, community_summary in community_summaries.items()
        if id in community_ids
    ]

    final_answer = self.aggregate_answers(community_answers)
    return final_answer

Approach: This version introduces a filtering mechanism to process only relevant community summaries.
Steps:
1. Extracts entities from the query string using the get_entities method.
2. Retrieves community IDs associated with these entities using the retrieve_entity_communities method.
3. Retrieves all community summaries from graph_store.
4. Generates answers only for the community summaries that match the retrieved community IDs.
5. Aggregates these answers into a final response using the aggregate_answers method.
Efficiency: This approach is more efficient as it processes only the community summaries relevant to the query, potentially improving performance and relevance of the final answer.

In summary, the V2 version enhances the efficiency and relevance of the query processing by introducing entity extraction and community filtering steps, ensuring that only pertinent community summaries are processed ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

Qutubullah1 commented 1 month ago

I have the same issue Nothing is getting matched with the regex pattern. Even though I am using the same notebook code with the same configuration and openai version. Any help will be greatly appreciated.

912100012 commented 1 month ago

@Qutubullah1 I suspect the author may have forgotten to add some formats，so I change the logic of get_entities().The return entity list is the all of the children of the retrieve chunk，then match "the return entities" and "community entity_info" entity.If match,I get its corresponding community and query，otherwise pass and do nothing.The specific effect may have some unsatisfactory, this is just an idea.You can have a try.The code as Follow:

    def get_entities(self, query_str, similarity_top_k):

        nodes_retrieved = self.index.as_retriever(
            similarity_top_k=similarity_top_k
        ).retrieve(query_str)

        chunks_retrieved_ids = [node.node_id for node in nodes_retrieved]
        entities_set = set()
        for chunk_id in chunks_retrieved_ids:
            # construct Cypher query:return all entity of specific chunk id
            cypher_query = """
                MATCH (chunk:Chunk {id: $chunk_id})-[:MENTIONS]->(entity:__Entity__)
                RETURN entity.name as name;
                """
            params = {
                "chunk_id": chunk_id
            }
            entities_nodes = self.index.property_graph_store.structured_query(
                cypher_query, params
            )
            entities_set.update([node['name'] for node in entities_nodes])

        return list(entities_set)

hypernovas commented 1 month ago

@912100012 this is a good hint to make it work!

Curious do you guys got any issues with relationship/entity parsing? It does not work for me and I have to write one on my own, like

class Entity(BaseModel):
    entity_id_name: str # can not be null, if null, use the summary of the main body
    entity_summary: str
    entity_type: str
    entity_description: str

class Relationship(BaseModel):
    source_entity: Entity # starting node of a graph relationship
    target_entity: Entity # ending node of a graph relationship
    relationship: str # relationship between source and target entities
    relationship_description: str # description of the relationship

# Metadata model for queries
class GraphRelationship(BaseModel):
    all_entities: List[Entity]
    relationships: List[Relationship]
    graph_summary: str
    top_keywords: List[str]

def relationship_parsing(text: str) -> GraphRelationship:
    """Extract relationships from a graph.
    """

    # use LLM to determine the query tag
    client = OAI(api_key=get_openai_api_key())
    completion = client.beta.chat.completions.parse(
        model="gpt-4o-2024-08-06",
        messages=[
            {"role": "system", "content": KG_TRIPLET_EXTRACT_TMPL},
            {"role": "user", "content": text},
        ],
        response_format=GraphRelationship,
    )

    graph_relationships = completion.choices[0].message.parsed
    return graph_relationships

hypernovas commented 1 month ago

Ok, the fix is pip install llama-index-graph-stores-neo4j==0.3.2

912100012 commented 1 month ago

@Qutubullah1 @hypernovas I tried it again and take back what I said before. The reason of matching empty [] is the first step called "query keyword extracting". If you get some words by extracting your query string like "keyword1","keyword2".And then they will match the entity based your constructed graph.If there are not entities called "keyword1" or "keyword2".It retrieve null and then match null. The reason for this is that the document I used is Chinese, so prompt did not fit,when I try to add the parse_fn to add a keyword that it can match the entity in the graph .So I will custom the prompt to fit my pipeline.

   def output_parsing_fn(response) -> Any:
    if isinstance(response, str):
        join_res = response + "^keyword"
    else:
        raise Exception("output_parsing_fn    error")
    matches = join_res.strip().split("^")
    return matches

        def get_entities(self, query_str, similarity_top_k):
        from llama_index.core.indices.property_graph import VectorContextRetriever, LLMSynonymRetriever
        graph_retrieve = LLMSynonymRetriever(
            graph_store=self.index.property_graph_store,
            include_text=False,
            output_parsing_fn=output_parsing_fn
        )
        vector_retrieve = VectorContextRetriever(
            graph_store=self.index.property_graph_store,
            vector_store=self.index.vector_store,
        )

        nodes_retrieved = self.index.as_retriever(
            similarity_top_k=similarity_top_k,
            sub_retrievers=[graph_retrieve, vector_retrieve],
            include_text=False,
            max_keywords=25
        ).retrieve(query_str)

If you want to make the matching line work,the query keyword extracting is important!

likewindd commented 1 month ago

@Qutubullah1 @hypernovas I tried it again and take back what I said before. The reason of matching empty [] is the first step called "query keyword extracting". If you get some words by extracting your query string like "keyword1","keyword2".And then they will match the entity based your constructed graph.If there are not entities called "keyword1" or "keyword2".It retrieve null and then match null. The reason for this is that the document I used is Chinese, so prompt did not fit,when I try to add the parse_fn to add a keyword that it can match the entity in the graph .So I will custom the prompt to fit my pipeline.
def output_parsing_fn(response) -> Any:
 if isinstance(response, str):
     join_res = response + "^keyword"
 else:
     raise Exception("output_parsing_fn    error")
 matches = join_res.strip().split("^")
 return matches

     def get_entities(self, query_str, similarity_top_k):
     from llama_index.core.indices.property_graph import VectorContextRetriever, LLMSynonymRetriever
     graph_retrieve = LLMSynonymRetriever(
         graph_store=self.index.property_graph_store,
         include_text=False,
         output_parsing_fn=output_parsing_fn
     )
     vector_retrieve = VectorContextRetriever(
         graph_store=self.index.property_graph_store,
         vector_store=self.index.vector_store,
     )

     nodes_retrieved = self.index.as_retriever(
         similarity_top_k=similarity_top_k,
         sub_retrievers=[graph_retrieve, vector_retrieve],
         include_text=False,
         max_keywords=25
     ).retrieve(query_str)
If you want to make the matching line work,the query keyword extracting is important!

I also use Chinese documents and met this problem. Solved！Thank you very much!

912100012 commented 1 month ago

@likewindd 不用谢哥们

likewindd commented 1 month ago

new findings!

If the keywords generated by llm from given query are not in the entities generated by llm from previous documents, then it still can't find corresponding entities which further resulting in empty answer.

If use Chinese documents, queries, language model……, try to edit the prompt to customize better.

Editing extractor prompt to make entities better and the prompt in "class LLMSynonymRetriever" which in "llm_synonym.py" in Chinese as we need to make keyword from query better.

(DEFAULT_SYNONYM_EXPAND_TEMPLATE = (

"Given some initial query, generate synonyms or related keywords up to {max_keywords} in total, "

# "considering possible cases of capitalization, pluralization, common expressions, etc.\n"
# "Provide all synonyms/keywords separated by '^' symbols: 'keyword1^keyword2^...'\n"
# "Note, result should be in one-line, separated by '^' symbols."
# "----\n"
# "QUERY: {query_str}\n"
# "----\n"
# "KEYWORDS: ")

Buzeg commented 1 week ago

请问你有使用过国内的qwen或者智谱的api来实现吗，我用了之后提取出来的是空集合，但是文档用的还是llamaindex自身的文档

Buzeg commented 5 days ago

我又试了一次，收回了我之前说过的话。匹配空 [] 的原因是第一步叫“查询关键字提取”。如果您通过提取查询字符串获得一些单词，例如 “keyword1”，“keyword2”。然后，它们将基于您构建的图形匹配实体。如果没有名为 “keyword1” 或 “keyword2” 的实体。它检索 null，然后匹配 null。这样做的原因是我用的文档是中文的，所以提示不合适，当我尝试添加parse_fn时添加一个关键字，它可以匹配图表中的实体。因此，我将自定义提示以适应我的管道。
def output_parsing_fn(response) -> Any:
 if isinstance(response, str):
     join_res = response + "^keyword"
 else:
     raise Exception("output_parsing_fn    error")
 matches = join_res.strip().split("^")
 return matches

     def get_entities(self, query_str, similarity_top_k):
     from llama_index.core.indices.property_graph import VectorContextRetriever, LLMSynonymRetriever
     graph_retrieve = LLMSynonymRetriever(
         graph_store=self.index.property_graph_store,
         include_text=False,
         output_parsing_fn=output_parsing_fn
     )
     vector_retrieve = VectorContextRetriever(
         graph_store=self.index.property_graph_store,
         vector_store=self.index.vector_store,
     )

     nodes_retrieved = self.index.as_retriever(
         similarity_top_k=similarity_top_k,
         sub_retrievers=[graph_retrieve, vector_retrieve],
         include_text=False,
         max_keywords=25
     ).retrieve(query_str)
如果你想让匹配的行工作，查询关键词的提取很重要！
我也使用中文文档并遇到了这个问题。解决了！非常感谢！

你好，请问这一段代码应该放在哪里呀！希望能达到您的解答！！

912100012 commented 5 days ago

@Buzeg 有的时候模型的选择不同，以及语言的不同，官方的demo也可能不奏效，建议debug跟一下，同时开启一下日志。至于你这个代码放在哪里？我有点读不太懂。建议跟随一下这个文档。 https://docs.llamaindex.ai/en/stable/examples/cookbooks/GraphRAG_v2/

912100012 commented 5 days ago

new findings!

If the keywords generated by llm from given query are not in the entities generated by llm from previous documents, then it still can't find corresponding entities which further resulting in empty answer.

If use Chinese documents, queries, language model……, try to edit the prompt to customize better.

Editing extractor prompt to make entities better and the prompt in "class LLMSynonymRetriever" which in "llm_synonym.py" in Chinese as we need to make keyword from query better.

(DEFAULT_SYNONYM_EXPAND_TEMPLATE = ( # "Given some initial query, generate synonyms or related keywords up to {max_keywords} in total, " # "considering possible cases of capitalization, pluralization, common expressions, etc.\n" # "Provide all synonyms/keywords separated by '^' symbols: 'keyword1^keyword2^...'\n" # "Note, result should be in one-line, separated by '^' symbols." # "----\n" # "QUERY: {query_str}\n" # "----\n" # "KEYWORDS: ")

你可以看一下他这个，更改一下他的prompt，以便更好地适配中文，我还没有测试过，但是他这个思路是可取的，为空的原因就是prompt、大模型这些没有适配好。从而关键词无法进行全等匹配。如果你不知道怎么定义prompt，建议看一下这个类LLMSynonymRetriever的API,应该就是构造函数传递一个参数的事。@Buzeg

912100012 commented 5 days ago

@likewindd Good idea!

Buzeg commented 5 days ago

新发现！如果 llm 从给定的查询中生成的关键字不在 llm 从以前的文档中生成的实体中，那么它仍然找不到相应的实体，进而导致空答案。如果使用中文文档、查询、语言模型......，请尝试编辑提示以更好地自定义。编辑提取器提示以使实体更好，并在中文的“类 LLMSynonymRetriever”中以“llm_synonym.py”提示，因为我们需要使查询中的关键字更好。（DEFAULT_SYNONYM_EXPAND_TEMPLATE = （ # “给定一些初始查询，生成同义词或相关关键字，总共最多 {max_keywords}” # “考虑可能的大小写、复数、常用表达等情况。\n” # “提供所有用 '^' 符号分隔的同义词/关键字： 'keyword1^keyword2^...'\n” # “注意，结果应该是单行的，用 '^' 符号分隔。” # “----\n” # “查询： {query_str}\n” # “----\n” # “关键字： ”）

你可以看一下他这个，更改一下他的prompt，以便更好地适配中文，我还没有测试过，但是他这个思路是可取的，为空的原因就是prompt、大模型这些没有适配好。从而关键词无法进行全等匹配。如果你不知道怎么定义prompt，建议看一下这个类LLMSynonymRetriever的API，应该就是构造函数传递一个参数的事。

万分感谢！！！我去看看

Buzeg commented 5 days ago

感谢！跑出来了！ @912100012

Buzeg commented 4 days ago

请问在index.property_graph_store.build_communities()建立社区的时候，耗时是不是比较长，需要4、5分钟 @912100012 @likewindd

912100012 commented 2 days ago

@Buzeg 对的。这个你看一眼这一块，把日志开启起来，就能知道程序在干什么。 https://docs.llamaindex.ai/en/stable/understanding/tracing_and_debugging/tracing_and_debugging/

然后，我给你简单解释一下，如下图，这个函数在生成社区聚类之后，会对每个社区生成总结。这是一个调用大模型的过程，也就是你有多少个社区，也就会调用多少次。所以是一个time-spend的一个过程。

Buzeg commented 2 days ago

@Buzeg 对的。这个你看一眼这一块，把日志开启起来，就能知道程序在干什么。 https://docs.llamaindex.ai/en/stable/understanding/tracing_and_debugging/tracing_and_debugging/

然后，我给你简单解释一下，如下图，这个函数在生成社区聚类之后，会对每个社区生成总结。这是一个调用大模型的过程，也就是你有多少个社区，也就会调用多少次。所以是一个time-spend的一个过程。

非常感谢您的解答！这样的graphrag其实和微软那个其实不太一样对吗，这个不仅是每个社区都生成总结，然后提问的时候llamaindex的queryengine是提取有关的社区都做一遍回答，然后再聚合一次提供给llm，给出最终回答。（也就是相当于local的模式？如果是global模式应该就是llamaindex的那个v1版本的graphrag？）但是不管是在index+build community阶段还是query阶段都比微软开源的那个graphrag慢。此外，我还有一个问题就是，当我想利用graphrag来回答问题比如：“这篇文章主要讲了什么故事？”，其实再get_entity的时候返回的有关entity是不准的，对于这类问题，是不是不适合做graphrag？ @912100012

912100012 commented 23 hours ago

@Buzeg 微软那套我没有细致看，所以你的第一个问题我不能确定是否是local模式还是什么。但我们在讨论的时候确实发现LM的V2版本和微软思想有些出入。至于第二个问题，你要考虑你的目标用户是什么？如果就是面向某个领域的“知识丰富人员”（比如董秘，科研者，作家），这个系统他知道该怎么用，就无需考虑适配这种抽象的问题，因为他们不会去这么问的。如果目标用户是普通大众，那么你的系统就要具有所有类型问题的泛化性，那么只针对这个graphrag来说，你需要在构建社区的时候进行一些处理，因为他的构建结构是聚类的，是有层级关系的，这种high-level问题，他一定会去那中大颗粒度的社区里面找，从而告诉你一个粗粒度的答案。也是能做到的。

总之，你的系统不应该因为某个query的不适配导致放弃某个技术。

run-llama / llama_index

[Question]: Issues with GraphRAG Implementation with LlamaIndex -V2 #16095

Question Validation

Question

V1 Version

V2 Version

"Given some initial query, generate synonyms or related keywords up to {max_keywords} in total, "