microsoft / graphrag

A modular graph-based Retrieval-Augmented Generation (RAG) system
https://microsoft.github.io/graphrag/
MIT License
18.67k stars 1.82k forks source link

[Issue]: Global Search takes a long time to respond for large datasets #928

Open andreiionut1411 opened 2 months ago

andreiionut1411 commented 2 months ago

Is there an existing issue for this?

Describe the issue

Dear Team, I am working with a csv type dataset that has 40k rows, with an average of 250 tokens per row. After indexing the dataset, I used the Global Search example and it took almost 5 minutes to get an answer back. I have also tried using the CLI for global querying and it answered in a bit less than 2 minutes, which is better, but still slow. In both cases, I used community level 0, as any other higher level would just take ages. I know that the larger the dataset, the longer it takes to iterate through the communities. However, the current duration of the response is too long. Is there any way to speed the Global Search query up?

Steps to reproduce

No response

GraphRAG Config Used

No response

Logs and screenshots

No response

Additional Information

No response

COPILOT-WDP commented 2 months ago

Can you share some more info about your dataset and the time it took to ingest? I am struggling a bit with larget datasets too.

andreiionut1411 commented 2 months ago

Can you share some more info about your dataset and the time it took to ingest? I am struggling a bit with larget datasets too.

I am working with a dataset comprised of 40k emails, which I added into a csv file with one email per row. For ingesting it took around 65 hours, where the vast majority of time was taken by the LLM calls, and 1-2 hours were needed to create the graph and other parquet files needed.

natoverse commented 2 months ago

I don't have any great suggestions for you at the moment. Global search is time consuming, because it summarizes every community to find the best answers to your question. You are already using level 0, which will be the fewest communities summarized.

We are investigating ways to rank the communities so you can set a threshold, but I don't have a concrete timeline for a real implementation.

If you think your domain/content can be filtered, you could filter down the create_final_communities.parquet rows based on criteria in there. You could do this as a post-indexing step if it can be static, or at runtime by implementing your own search context builder (see this comment: https://github.com/microsoft/graphrag/issues/917#issuecomment-2287126477)

yuan-head commented 2 months ago

I don't have any great suggestions for you at the moment. Global search is time consuming, because it summarizes every community to find the best answers to your question. You are already using level 0, which will be the fewest communities summarized.

We are investigating ways to rank the communities so you can set a threshold, but I don't have a concrete timeline for a real implementation.

If you think your domain/content can be filtered, you could filter down the create_final_communities.parquet rows based on criteria in there. You could do this as a post-indexing step if it can be static, or at runtime by implementing your own search context builder (see this comment: #917 (comment))

can i ask about the methods that can help with shorten the time cost by the searching task?

mzh1996 commented 2 months ago

I have the same problem.

Is it possible to use some filters on the communities before the map stage? For example, only the communities sharing same entities with the query are kept. Or, we can only use the topK communities that are most similar to the query in the map stage, and the similarity can be measured by the cosine similarity between the report embedding and the query embedding.

(Just my personal suggestions. Hope the authors of this great project can provide better solutions)

natoverse commented 2 months ago

I have the same problem.

Is it possible to use some filters on the communities before the map stage? For example, only the communities sharing same entities with the query are kept. Or, we can only use the topK communities that are most similar to the query in the map stage, and the similarity can be measured by the cosine similarity between the report embedding and the query embedding.

(Just my personal suggestions. Hope the authors of this great project can provide better solutions)

Yes, those could be helpful. There is a balance between the high-level thematic capabilities of global search versus specific questions. For example, if you ask a very broad question such as "what are the top themes in the dataset", it is very difficult to filter because semantic search should not get good hits. However, if you add small qualifiers such as "what are the top political themes in the dataset", you have now added a keyword that would allow semantic search to return ranked community summaries that discuss politics. We are investigating exactly this right now, and hope to have some solutions soon. Again, no firm timeline, but we're looking very closely at this.

disperaller commented 2 months ago

I have the same problem.

Is it possible to use some filters on the communities before the map stage? For example, only the communities sharing same entities with the query are kept. Or, we can only use the topK communities that are most similar to the query in the map stage, and the similarity can be measured by the cosine similarity between the report embedding and the query embedding.

(Just my personal suggestions. Hope the authors of this great project can provide better solutions)

i changed the code to implement the topK strategy, which sort the community reports based on occurrence_weight and select the top K reports from the pool. Help speed up the searching process.