microsoft / graphrag

A modular graph-based Retrieval-Augmented Generation (RAG) system
https://microsoft.github.io/graphrag/
MIT License
12.43k stars 1.03k forks source link

Consider adding an entity reconciliation step that merges nodes that seem to be duplicate #401

Open eyast opened 2 weeks ago

eyast commented 2 weeks ago

When running GraphRAG on a story such as the complete works of Sherlock Holmes the generated graph contains individual nodes which should have been consolidated into one. For example, there are unique nodes for:

Other nodes in the graph seem to be sparingly connected. For example "Baker Street" has an edge with "Mr. Holmes" but no other variants. I suspect this might lead to unique cluster formations that might affect downstream summarization. Should there be an optional step that attempts at reconciling these entities? I imagine there might not be a single blanket approach to do this (I can imagine many edge cases where the output above might be correct in another context), but maybe ask the user to mix and match to select if she wants to 'fuse' the node 'Sherlock' with the node 'Sherlock Holmes', concatenating them into one?

For reference, output of the artifacts folder including graphs etc: https://www.dropbox.com/scl/fi/hfo1nppit6tfczrypc7tb/sherlock-holmes-artifacts.zip?rlkey=t3sx7g3q48tw5fl2eek6tg3la&dl=0

COPILOT-WDP commented 2 weeks ago

Which model have you used? GPT-4o?

The documentation mentions a "destructive entity resolution" step which is not enabled by default. I believe it is not implemented at all in the currently released code base (or I would not know where to find it):

image

It should be possible to manually resolve entities and update the graph but a best-effort optional approach would be great!