microsoft / graphrag

A modular graph-based Retrieval-Augmented Generation (RAG) system
https://microsoft.github.io/graphrag/
MIT License
19.41k stars 1.92k forks source link

[Feature Request]: Improved coreference resolution when building knowledge graph #1244

Open fpaupier opened 1 month ago

fpaupier commented 1 month ago

Do you need to file an issue?

Is your feature request related to a problem? Please describe.

Duplicate or closely related entities are present in the knowledge graph after the indexation phase, decreasing the semantic and structural quality of the graph.


During the indexation process, the current graph extraction process is to prompt a LLM up _max_gleanings times to extract entities and relationships, as defined in the GraphExtractor class. If this iterative approach works well to increase the number of entities extracted from the input document, as seen in the GraphRAG paper figure 2 - see below - I notice in my usage it also brings several duplicates entities that refer to the same real world concept/entity, yielding a noisy knowledge graph that is not as actionnable as it coul be.

Screenshot 2024-10-03 at 07 10 26

Describe the solution you'd like

Add coreference resolution in the graph extraction during indexation phase.


In the GraphExtractor class, once we perform the _process_document loop, we could add a step of coreference resolution on the extracted entities before we proceed to build the networkx graph

Additional context

No response

fpaupier commented 1 month ago

Hi @AlonsoGuevara, I can provide with a PR on this topic - would the project be open to merge it if proposed ?