microsoft / graphrag

A modular graph-based Retrieval-Augmented Generation (RAG) system
https://microsoft.github.io/graphrag/
MIT License
17.14k stars 1.61k forks source link

Incremental indexing (adding new content) #741

Open natoverse opened 1 month ago

natoverse commented 1 month ago

A number of users have asked how to add new content to an existing index without needing to re-run the entire process. This is a feature we are planning, and are in the design stages now to ensure we have an efficient approach.

As it stands, new content can be added to a GraphRAG index without requiring a complete re-index. This is because we rely heavily on a cache to avoid repeating the same calls to the model API. There are several stages within the pipeline for which this is very efficient - namely those stages that are atomic and do not have upstream dependencies. For example, if you add new documents to a folder, we will not re-chunk existing documents or perform entity and relationship extraction on the existing content; we will simply fetch the processed content from the cache and pass it on. The new documents will be processed and new entities and relationships extracted. Downstream of this, the graph construction process will need to recreate the graph to include the new nodes and edges, and communities will be recomputed - resulting in re-summarization, etc. You can get a sense of this process and what downstream steps may be re-processed by looking at the data flow diagram.

Describe the solution you'd like

An ideal solution would be to add a new command to GraphRAG such as update that can be run against new data and augment an existing index. Considerations here include things such as evaluating the new entities to determine if they can be added to an existing community, and when those communities have been altered enough to constitute a "drift" that needs recomputing. We could also perform analysis to determine which communities have been edited, such that we ignore summarization on those that haven't changed.

Additional context

We also need to consider the types of analysis incremental ingest can enable beyond just "updates". For example, daily ingest of news with thoughtful graph construction/annotation could allow for delta analysis such that questions like "what happened with person x in the last 24 hours" or "catch me up on the news themes this week".

Some desired types of editing users have described in other issues:

Scope

For now we are going to limit the scope of this feature to just an incremental index update to append content, and not worry about removal, manual graph editing, or the metadata tagging that would be required to do delta-style queries.

Approach

Putting here a little more detail on the approach we've discussed. It largely echoes what I put above as ideas, but I'll repeat for clarity:

natoverse commented 1 month ago

This is a popular request, so I'm going to pin it and route other issues here.

natoverse commented 1 month ago

Related: removing existing content, e.g., #585

KylinMountain commented 1 month ago

@natoverse can we split index into graph build and community summary? Lots of fans ask me if we can modify the knowledge graph in manual as sometimes the entity or relationship is wrong?

KylinMountain commented 1 month ago

If you modify the Ilm params in settings.yaml, all of cache will be invalid.

natoverse commented 1 month ago

Additional use case: adding files of a different type: https://github.com/microsoft/graphrag/issues/784

ljhskyso commented 1 month ago

Any ETA on this feature? Need this to assess whether I need to implement my own solution. @natoverse

shaoqing404 commented 1 month ago

I think it is more urgent to change the cache from file to milvus(or more vector DB). The bottleneck that affects the overall query timeliness of graphrag has a significant impact on IO in relatively large files.

vishyarjun commented 2 weeks ago

there are two distinct scenarios for adding documents to the index, each requiring a different approach to community management and querying: Scenario 1: Siloed Document Communities

Scenario 2: Unified Document Collection

gusye1234 commented 1 week ago

Hi, maybe checkout this repo, it supports incremental insert for entities and relationships. Also will compute the hash of the docs so only insert the new docs everytime you insert

shandianshan commented 3 days ago

@gusye1234 Hi, thank you for sharing your work. Could you please explain how you addressed the issue of integrating new entities into the community.

gusye1234 commented 1 day ago

@gusye1234 Hi, thank you for sharing your work. Could you please explain how you addressed the issue of integrating new entities into the community.

Sure. nano-graphrag use the md5 hash of docs and chunks as their key. When inserting begins, the same docs and chunks will be ignored and only the new chunks will continue to insert. nano-graphrag will automatically load the previous graph from the working dir, and the new entities and relationships will be added on the current graph.

However, everytime you insert, nano-graphrag will still re-compute the communities and generate new community reports. The incremental update of communities is not yet implemented.

Tipik1n commented 1 day ago

The lack of this feature is also holding me back ( and probably many more ) from fully committing to using this repo as a solution in the gen ai space, Would love to see how you implement this.

yaroslavyaroslav commented 1 day ago

Hi, maybe checkout this repo,

@gusye1234 If you changed this repo by deleting the branch specification, it would be accessible as the GitHub repo from within the GitHub mobile app, which I suppose would increase the number of stars by making it easier for mobile users. Because right now, it just opens the working directory instead.