microsoft / graphrag

A modular graph-based Retrieval-Augmented Generation (RAG) system
https://microsoft.github.io/graphrag/
MIT License
19.04k stars 1.87k forks source link

Performance issue with GraphRag on large file processing (7GB) – slow load time and verb function not being triggered #1277

Open 9prodhi opened 1 month ago

9prodhi commented 1 month ago

I am using GraphRag to process a large file (~7GB). While the processing works fine for smaller files (in MB range), the workflow experiences significant delays when handling the larger file. The file takes a long time to load, and after over an hour, the workflow hasn't reached the verb's execution.

Here are the details of the issue:

Small File Processing:

Large File Processing:

Although the verb function is not being called for larger files yet, I would also like to ask about optimizing performance for large file processing. Here's the relevant code snippet I am using:

import logging
from enum import Enum
from typing import Any, cast
import pandas as pd
import io
from datashaper import (
    AsyncType,
    TableContainer,
    VerbCallbacks,
    VerbInput,
    derive_from_rows,
    verb,
)
from graphrag.index.bootstrap import bootstrap
from graphrag.index.cache import PipelineCache
from graphrag.index.storage import PipelineStorage
from graphrag.index.llm import load_llm
from graphrag.llm import CompletionLLM
from graphrag.config.enums import LLMType

@verb(name="nomic_embed")
async def nomic_embed(
    input: VerbInput,
    cache: PipelineCache,
    storage: PipelineStorage,
    callbacks: VerbCallbacks,
    column: str,
    id_column: str,
    to: str,
    async_mode: AsyncType = AsyncType.AsyncIO,
    num_threads: int = 108,
    batch_size: int = 150000,
    output_file: str = "embed_results.parquet",
    **kwargs,
)

I am using the num_threads and batch_size parameters to parallelize the nomic_embed verb for reducing processing time of large files.

Are there any recommended approach or any additional parameters I should consider for processing large files with GraphRag?

PassStory commented 4 weeks ago

When building the graph, the most time-consuming part seems to be accessing the LLM. Even though the code thoughtfully uses asynchronous methods, the time consumption is still significant. I attempted to modify the code to batch mode for the LLM, but the data involves multiple layers of API calls, making it difficult to implement. I’m curious whether the size of the data used by the publisher for experiments is only for laboratory mode.