Closed Edwin-poying closed 3 months ago
Here is my settings.yml file
encoding_model: cl100k_base
skip_workflows: []
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_chat # or azure_openai_chat
model: ${GRAPHRAG_LLM_MODEL}
model_supports_json: true # recommended if this is available for your model.
# max_tokens: 4000
# request_timeout: 180.0
api_base: ${GRAPHRAG_API_BASE}
# api_version: 2024-02-15-preview
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
# tokens_per_minute: 150_000 # set a leaky bucket throttle
# requests_per_minute: 10_000 # set a leaky bucket throttle
# max_retries: 10
# max_retry_wait: 10.0
# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
# concurrent_requests: 25 # the number of parallel inflight requests that may be made
parallelization:
stagger: 0.3
# num_threads: 50 # the number of threads to use for parallel processing
async_mode: threaded # or asyncio
embeddings:
## parallelization: override the global parallelization settings for embeddings
async_mode: threaded # or asyncio
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_embedding # or azure_openai_embedding
model: ${GRAPHRAG_EMBEDDING_MODEL}
api_base: ${API_BASE}
# api_version: 2024-02-15-preview
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
# tokens_per_minute: 150_000 # set a leaky bucket throttle
# requests_per_minute: 10_000 # set a leaky bucket throttle
# max_retries: 10
# max_retry_wait: 10.0
# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
# concurrent_requests: 25 # the number of parallel inflight requests that may be made
batch_size: 16 # the number of documents to send in a single request
# batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
# target: required # or optional
chunks:
size: 300
overlap: 100
group_by_columns: [id] # by default, we don't allow chunks to cross documents
input:
type: file # or blob
file_type: text # or csv
base_dir: "input"
file_encoding: utf-8
file_pattern: ".*0731\\.txt$"
# file_pattern: ".*\\.txt$"
cache:
type: file # or blob
base_dir: "cache"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>
storage:
type: file # or blob
base_dir: "output/${timestamp}/artifacts"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>
reporting:
type: file # or console, blob
base_dir: "output/${timestamp}/reports"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>
entity_extraction:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/entity_extraction.txt"
entity_types: [product,group,job,feature,case,solution]
max_gleanings: 0
summarize_descriptions:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/summarize_descriptions.txt"
max_length: 500
claim_extraction:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
enabled: true
prompt: "prompts/claim_extraction.txt"
description: "Any claims or facts that could be relevant to information discovery."
max_gleanings: 0
community_report:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/community_report.txt"
max_length: 2000
max_input_length: 8000
cluster_graph:
max_cluster_size: 10
embed_graph:
enabled: false # if true, will generate node2vec embeddings for nodes
# num_walks: 10
# walk_length: 40
# window_size: 2
# iterations: 3
# random_seed: 597832
umap:
enabled: false # if true, will generate UMAP embeddings for nodes
snapshots:
graphml: True
raw_entities: false
top_level_nodes: false
local_search:
# text_unit_prop: 0.5
# community_prop: 0.1
# conversation_history_max_turns: 5
# top_k_mapped_entities: 10
# top_k_relationships: 10
# max_tokens: 12000
global_search:
# max_tokens: 12000
# data_max_tokens: 12000
# map_max_tokens: 1000
# reduce_max_tokens: 2000
# concurrency: 32
If your settings have not been changed, and the original file is still in the folder, then it should use the cache in several places. For example, the text units (chunks) should be identical, so graph extraction should use the cache for those. However, and new entities and relationships extracted from the second file will trigger re-compute of the communities, and therefore all of the community summarization, which can be much of your overall expense. We're tracking more efficient incremental indexing with #741.
Hi, @natoverse , I have changed the file_pattern field in the input setting to deal with the specific file. Does this matter?
Below is how I change: input: type: file # or blob file_type: text # or csv base_dir: "input" file_encoding: utf-8 # file_pattern: ".\.txt$" file_pattern: ".0731\.txt$"
I don't think it should matter - the key to getting an accurate cache is that we hash all of the LLM params and prompt so that identical API calls are avoided. This is done per step, so individual parameter changes should only affect the steps that rely on them.
Thank you @natoverse for your graphRAG and your answer,
I still have one question that is related to this topic:
Since I generated graphRAG using two files at first; however, I decided to build a graphRAG using one of them later. I am wondering whether the system needs to regenerate the entity summary because the description list may be changed resulting from the reduction of input documents. So as to the summaries of relationships and claims.
The entity/relationship extraction step is separate from the summarization. When extracting, each entity and relationship is given a description by the LLM. This will get the benefit of the cache. Before creating the community reports, the descriptions for each entity are combined into a single "canonical" description. This is also done by the LLMs, and if you have new instances of the entities, it should not use the cache.
Many thanks
Hi, Here is the scenerio I currently confront:
I build a graphRAG based on two distinct .txt file. And later, I want to see if I can build a graphRAG based on one of them. After I modify the settings file to ensure that only one file gets ingested, I run the following command
python -m graphrag.index --root .
I was expecting that this act will not cost too much if the indexing stage can leverage the cache; however, it still make complete calls to Openai to build the graph.
So, can someone tells me if I did wrong or this scenerio has not been supported yet.
Many thanks.