[X] I have searched the existing issues and this feature is not already filed.
[ ] My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
[X] I believe this is a legitimate feature request, not just a question. If this is a question, please use the Discussions area.
Is your feature request related to a problem? Please describe.
Duplicate or closely related entities are present in the knowledge graph after the indexation phase, decreasing the semantic and structural quality of the graph.
During the indexation process, the current graph extraction process is to prompt a LLM up _max_gleanings times to extract entities and relationships, as defined in the GraphExtractor class. If this iterative approach works well to increase the number of entities extracted from the input document, as seen in the GraphRAG paper figure 2 - see below - I notice in my usage it also brings several duplicates entities that refer to the same real world concept/entity, yielding a noisy knowledge graph that is not as actionnable as it coul be.
In the GraphExtractor class, once we perform the _process_document loop, we could add a step of coreference resolution on the extracted entities before we proceed to build the networkx graph
Do you need to file an issue?
Is your feature request related to a problem? Please describe.
Duplicate or closely related entities are present in the knowledge graph after the indexation phase, decreasing the semantic and structural quality of the graph.
During the indexation process, the current graph extraction process is to prompt a LLM up
_max_gleanings
times to extract entities and relationships, as defined in theGraphExtractor
class. If this iterative approach works well to increase the number of entities extracted from the input document, as seen in the GraphRAG paper figure 2 - see below - I notice in my usage it also brings several duplicates entities that refer to the same real world concept/entity, yielding a noisy knowledge graph that is not as actionnable as it coul be.Describe the solution you'd like
Add coreference resolution in the graph extraction during indexation phase.
In the
GraphExtractor
class, once we perform the_process_document
loop, we could add a step of coreference resolution on the extracted entities before we proceed to build the networkx graphAdditional context
No response