microsoft / graphrag

A modular graph-based Retrieval-Augmented Generation (RAG) system
https://microsoft.github.io/graphrag/
MIT License
20.17k stars 1.97k forks source link

[Issue]: Does cache still work if extracting one-file graphRAG from a multiple files graphRAG? #819

Closed Edwin-poying closed 3 months ago

Edwin-poying commented 4 months ago

Hi, Here is the scenerio I currently confront:

I build a graphRAG based on two distinct .txt file. And later, I want to see if I can build a graphRAG based on one of them. After I modify the settings file to ensure that only one file gets ingested, I run the following command python -m graphrag.index --root .

I was expecting that this act will not cost too much if the indexing stage can leverage the cache; however, it still make complete calls to Openai to build the graph.

So, can someone tells me if I did wrong or this scenerio has not been supported yet.

Many thanks.

Edwin-poying commented 4 months ago

Here is my settings.yml file


encoding_model: cl100k_base
skip_workflows: []
llm:
  api_key: ${GRAPHRAG_API_KEY}
  type: openai_chat # or azure_openai_chat
  model: ${GRAPHRAG_LLM_MODEL}
  model_supports_json: true # recommended if this is available for your model.
  # max_tokens: 4000
  # request_timeout: 180.0
  api_base: ${GRAPHRAG_API_BASE}
  # api_version: 2024-02-15-preview
  # organization: <organization_id>
  # deployment_name: <azure_model_deployment_name>
  # tokens_per_minute: 150_000 # set a leaky bucket throttle
  # requests_per_minute: 10_000 # set a leaky bucket throttle
  # max_retries: 10
  # max_retry_wait: 10.0
  # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
  # concurrent_requests: 25 # the number of parallel inflight requests that may be made

parallelization:
  stagger: 0.3
  # num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:
  ## parallelization: override the global parallelization settings for embeddings
  async_mode: threaded # or asyncio
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: openai_embedding # or azure_openai_embedding
    model: ${GRAPHRAG_EMBEDDING_MODEL}
    api_base: ${API_BASE}
    # api_version: 2024-02-15-preview
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>
    # tokens_per_minute: 150_000 # set a leaky bucket throttle
    # requests_per_minute: 10_000 # set a leaky bucket throttle
    # max_retries: 10
    # max_retry_wait: 10.0
    # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
    # concurrent_requests: 25 # the number of parallel inflight requests that may be made
    batch_size: 16 # the number of documents to send in a single request
    # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
    # target: required # or optional

chunks:
  size: 300
  overlap: 100
  group_by_columns: [id] # by default, we don't allow chunks to cross documents

input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*0731\\.txt$"
  # file_pattern: ".*\\.txt$"

cache:
  type: file # or blob
  base_dir: "cache"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

storage:
  type: file # or blob
  base_dir: "output/${timestamp}/artifacts"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

reporting:
  type: file # or console, blob
  base_dir: "output/${timestamp}/reports"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

entity_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/entity_extraction.txt"
  entity_types: [product,group,job,feature,case,solution]
  max_gleanings: 0

summarize_descriptions:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 0

community_report:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes
  # num_walks: 10
  # walk_length: 40
  # window_size: 2
  # iterations: 3
  # random_seed: 597832

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: True
  raw_entities: false
  top_level_nodes: false

local_search:
  # text_unit_prop: 0.5
  # community_prop: 0.1
  # conversation_history_max_turns: 5
  # top_k_mapped_entities: 10
  # top_k_relationships: 10
  # max_tokens: 12000

global_search:
  # max_tokens: 12000
  # data_max_tokens: 12000
  # map_max_tokens: 1000
  # reduce_max_tokens: 2000
  # concurrency: 32
natoverse commented 4 months ago

If your settings have not been changed, and the original file is still in the folder, then it should use the cache in several places. For example, the text units (chunks) should be identical, so graph extraction should use the cache for those. However, and new entities and relationships extracted from the second file will trigger re-compute of the communities, and therefore all of the community summarization, which can be much of your overall expense. We're tracking more efficient incremental indexing with #741.

Edwin-poying commented 4 months ago

Hi, @natoverse , I have changed the file_pattern field in the input setting to deal with the specific file. Does this matter?

Below is how I change: input: type: file # or blob file_type: text # or csv base_dir: "input" file_encoding: utf-8 # file_pattern: ".\.txt$" file_pattern: ".0731\.txt$"

natoverse commented 4 months ago

I don't think it should matter - the key to getting an accurate cache is that we hash all of the LLM params and prompt so that identical API calls are avoided. This is done per step, so individual parameter changes should only affect the steps that rely on them.

Edwin-poying commented 4 months ago

Thank you @natoverse for your graphRAG and your answer,

I still have one question that is related to this topic:

Since I generated graphRAG using two files at first; however, I decided to build a graphRAG using one of them later. I am wondering whether the system needs to regenerate the entity summary because the description list may be changed resulting from the reduction of input documents. So as to the summaries of relationships and claims.

natoverse commented 4 months ago

The entity/relationship extraction step is separate from the summarization. When extracting, each entity and relationship is given a description by the LLM. This will get the benefit of the cache. Before creating the community reports, the descriptions for each entity are combined into a single "canonical" description. This is also done by the LLMs, and if you have new instances of the entities, it should not use the cache.

Edwin-poying commented 3 months ago

Many thanks