[X] I have searched the existing issues and this bug is not already filed.
[ ] My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
[ ] I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.
Describe the bug
The Global/Local search with graphrag CLI does not work when using Azure Storage Blob to store the content. When running search using CLI, I get an error. It looks for 'create_final_nodes.parquet' in my local file system, instead of Azure Blob, and errors out saying it could not locate that file
Steps to reproduce
1) Settings.json - I have added 'blob' as the storage option, and provided the connection string, container names
settings - Copy.txt
2) I initialize the index, and build the index. It completes successfully. I see that the Containers in Azure are populated with the files that we expect from the run
3) Now I perform a local search OR a Global search - it always errors out saying ''C:\create_final_nodes.parquet' could not be found
Expected Behavior
The Global Search and Local Search should have run successfully by referring to the above file(s) in Azure Blob Storage
GraphRAG Config Used
# Paste your config here
encoding_model: cl100k_base
skip_workflows: []
llm:
api_key: 'my llm key'
type: 'azure_openai_chat'
model: 'gpt-4o'
model_supports_json: true
# max_tokens: 4000
# request_timeout: 180.0
api_base: 'https://my-aoai-endpoint.openai.azure.com/'
api_version: 2024-02-15-preview
# organization: <organization_id>
deployment_name: 'gpt4-0'
# tokens_per_minute: 150_000
# requests_per_minute: 10_000
# max_retries: 10
# max_retry_wait: 10.0
# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
# concurrent_requests: 25 # the number of parallel inflight requests that may be made
# temperature: 0 # temperature for sampling
# top_p: 1 # top-p sampling
# n: 1 # Number of completions to generate
parallelization:
stagger: 0.3
# num_threads: 50 # the number of threads to use for parallel processing
async_mode: threaded # or asyncio
embeddings:
## parallelization: override the global parallelization settings for embeddings
async_mode: threaded # or asyncio
llm:
api_key: "${GRAPHRAG_API_KEY}"
type: 'azure_openai_embedding'
model: 'text-embedding-3-small'
api_base: 'https://my-llm-end-point.openai.azure.com/'
api_version: 2024-02-15-preview
# organization: <organization_id>
deployment_name: 'text-embedding-3-small'
# tokens_per_minute: 150_000 # set a leaky bucket throttle
# requests_per_minute: 10_000 # set a leaky bucket throttle
# max_retries: 10
# max_retry_wait: 10.0
# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
# concurrent_requests: 25 # the number of parallel inflight requests that may be made
# batch_size: 16 # the number of documents to send in a single request
# batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
# target: required # or optional
chunks:
size: 1200
overlap: 100
group_by_columns: [id]
input:
type: file # or blob
file_type: text # or csv
base_dir: "input"
file_encoding: utf-8
file_pattern: ".*\\.txt$"
cache:
type: blob
base_dir: "cache"
connection_string: 'DefaultEndpointsProtocol=https;AccountName=aistoragesvc;AccountKey=my-account-key;EndpointSuffix=core.windows.net'
container_name: 'graphrag-ites-sow-cache'
storage:
type: blob
base_dir: "output/${timestamp}/artifacts"
connection_string: 'DefaultEndpointsProtocol=https;AccountName=aistoragesvc;AccountKey=my-account-key;EndpointSuffix=core.windows.net'
container_name: 'graphrag-ites-sow-output'
reporting:
type: blob
base_dir: "output/${timestamp}/reports"
connection_string: 'DefaultEndpointsProtocol=https;AccountName=aistoragesvc;AccountKey=my-account-key;EndpointSuffix=core.windows.net'
container_name: 'graphrag-ites-sow-reporting'
entity_extraction:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/entity_extraction.txt"
entity_types: [organization, person, geo, event]
max_gleanings: 1
summarize_descriptions:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/summarize_descriptions.txt"
max_length: 500
claim_extraction:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
# enabled: true
prompt: "prompts/claim_extraction.txt"
description: "Any claims or facts that could be relevant to information discovery."
max_gleanings: 1
community_reports:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/community_report.txt"
max_length: 2000
max_input_length: 8000
cluster_graph:
max_cluster_size: 10
embed_graph:
enabled: false # if true, will generate node2vec embeddings for nodes
# num_walks: 10
# walk_length: 40
# window_size: 2
# iterations: 3
# random_seed: 597832
umap:
enabled: false # if true, will generate UMAP embeddings for nodes
snapshots:
graphml: false
raw_entities: false
top_level_nodes: false
local_search:
# text_unit_prop: 0.5
# community_prop: 0.1
# conversation_history_max_turns: 5
# top_k_mapped_entities: 10
# top_k_relationships: 10
# llm_temperature: 0 # temperature for sampling
# llm_top_p: 1 # top-p sampling
# llm_n: 1 # Number of completions to generate
# max_tokens: 12000
global_search:
# llm_temperature: 0 # temperature for sampling
# llm_top_p: 1 # top-p sampling
# llm_n: 1 # Number of completions to generate
# max_tokens: 12000
# data_max_tokens: 12000
# map_max_tokens: 1000
# reduce_max_tokens: 2000
# concurrency: 32
Do you need to file an issue?
Describe the bug
The Global/Local search with graphrag CLI does not work when using Azure Storage Blob to store the content. When running search using CLI, I get an error. It looks for 'create_final_nodes.parquet' in my local file system, instead of Azure Blob, and errors out saying it could not locate that file
Steps to reproduce
1) Settings.json - I have added 'blob' as the storage option, and provided the connection string, container names settings - Copy.txt 2) I initialize the index, and build the index. It completes successfully. I see that the Containers in Azure are populated with the files that we expect from the run 3) Now I perform a local search OR a Global search - it always errors out saying ''C:\create_final_nodes.parquet' could not be found
Expected Behavior
The Global Search and Local Search should have run successfully by referring to the above file(s) in Azure Blob Storage
GraphRAG Config Used
Logs and screenshots
No response
Additional Information