microsoft / graphrag

A modular graph-based Retrieval-Augmented Generation (RAG) system
https://microsoft.github.io/graphrag/
MIT License
16.83k stars 1.58k forks source link

[Bug]: Search fails with path errors with custom storage and reporting paths #599

Closed TheBugKing closed 4 days ago

TheBugKing commented 1 month ago

Describe the bug

I am trying to have a single folder where I can store indexed artifacts for the documents. I do not want to create different indexed folders by timestamps because when I add new documents, a new folder is generated. Consequently, when I want to perform a query, I do not know which index artifact to point to and load for the response.

As per discussion 354, we can add new documents and rerun the indexer, which will add new data to the summaries. Therefore, I modified the paths for storage and reports, which led to the issues mentioned below.

from base_dir: "output/${timestamp}/artifacts" to

  1. storage: type: file # or blob base_dir: "output/files/artifacts"

  2. reporting: type: file # or console, blob base_dir: "output/files/reports"

issues:

  1. image indexing engine logs still generating under timestamp folder, which leads to issues ahead while performing queries with local and global searches.

  2. case A: performing global and local searchs fails as it is not able to locate or load the correct path. when storage base_dir: "output/artifacts", it seems that the returned path is output/artifacts/artifact which does not exists at all.

    def _infer_data_dir(root: str) -> str: output = Path(root) / "output"

    use the latest data-run folder

    if output.exists(): folders = sorted(output.iterdir(), key=os.path.getmtime, reverse=True) if len(folders) > 0: folder = folders[0] return str((folder / "artifacts").absolute()) msg = f"Could not infer data directory from root={root}" raise ValueError(msg)

case B: When storage base_dir: "output/files/artifacts", and if there are multiple runs, the indexing-engine.log is still generated under timestamp folders. According to the logic, the folders are always sorted by the last modified timestamp. Now, the sorted folders are: Folders:

Result: local and global searches errors out due to path issues Note: I am running my llm models using llm studio and ollama locally to save costs.

Steps to reproduce

  1. modify paths from base_dir: "output/${timestamp}/artifacts" to storage: type: file # or blob base_dir: "output/files/artifacts"

    reporting: type: file # or console, blob base_dir: "output/files/reports"

  2. run indexer tow or more times

  3. run a globar search

Expected Behavior

  1. all reports and artifacts should be generated in the specified paths
  2. paths should be loaded as per the paths modifed or mentioned in the settings, may or may not include timestamps
  3. end goals is to be able to add new documents over existing indexed data and avoid running indexing on entire documents again, if we need to use a timestamp based output artifacts how can one query all over the old and new indexed data

GraphRAG Config Used

encoding_model: cl100k_base skip_workflows: [] llm: api_key: ${GRAPHRAG_API_KEY} type: openai_chat # or azure_openai_chat model: mistral model_supports_json: true # recommended if this is available for your model.

parallelization: stagger: 0.3

async_mode: threaded # or asyncio

embeddings: async_mode: threaded # or asyncio llm: api_key: ${GRAPHRAG_API_KEY} type: openai_embedding # or azure_openai_embedding model: nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q4_K_M.gguf

chunks: size: 400 overlap: 100 group_by_columns: [id] # by default, we don't allow chunks to cross documents

input: type: file # or blob file_type: text # or csv base_dir: "input" file_encoding: utf-8 file_pattern: ".*\.txt$"

cache: type: file # or blob base_dir: "cache"

storage: type: file # or blob base_dir: "output/files/artifacts"

reporting: type: file # or console, blob base_dir: "output/files/reports"

entity_extraction: prompt: "prompts/entity_extraction.txt" entity_types: [organization,person,geo,event] max_gleanings: 0

summarize_descriptions: prompt: "prompts/summarize_descriptions.txt" max_length: 200

claim_extraction: prompt: "prompts/claim_extraction.txt" description: "Any claims or facts that could be relevant to information discovery." max_gleanings: 0

community_report: prompt: "prompts/community_report.txt" max_length: 1000 max_input_length: 3000

cluster_graph: max_cluster_size: 10

embed_graph: enabled: false # if true, will generate node2vec embeddings for nodes

umap: enabled: false # if true, will generate UMAP embeddings for nodes

snapshots: graphml: false raw_entities: false top_level_nodes: false

local_search:

text_unit_prop: 0.5
community_prop: 0.1
conversation_history_max_turns: 5
top_k_mapped_entities: 10
top_k_relationships: 10
max_tokens: 12000

global_search:

max_tokens: 12000
data_max_tokens: 12000
map_max_tokens: 1000
reduce_max_tokens: 2000
concurrency: 32

Logs and screenshots

image

logs with some additional print statements: PS D:\WORK\PROJECTS\Python-POC\GraphRAG> python -m graphrag.query --root ./ragtest --method global "Who is antariksh"

args: None data dir is NONE ** folders: [WindowsPath('ragtest/output/20240717-141348'), WindowsPath('ragtest/output/20240717-131540'), WindowsPath('ragtest/output/files'), WindowsPath('ragtest/output/20240717-131330')]

INFO: Reading settings from ragtest\settings.yaml data_dir D:\WORK\PROJECTS\Python-POC\GraphRAG\ragtest\output\20240717-141348\artifacts -> invalid path root_dir ./ragtest *config { "llm": { "api_key": "", "type": "openai_chat", "model": "mistral", "max_tokens": 4000, "request_timeout": 180.0, "api_base": "http://127.0.0.1:11434/v1", "api_version": null, "organization": null, "proxy": null, "cognitive_services_endpoint": null, "deployment_name": null, "model_supports_json": true, "tokens_per_minute": 0, "requests_per_minute": 0, "max_retries": 10, "max_retry_wait": 10.0, "sleep_on_rate_limit_recommendation": true, "concurrent_requests": 25 }, "parallelization": { "stagger": 0.3, "num_threads": 50 }, "async_mode": "threaded", "root_dir": "./ragtest", "reporting": { "type": "file", "base_dir": "output/files/reports", "connection_string": null, "container_name": null, "storage_account_blob_url": null }, "storage": { "type": "file", "base_dir": "output/files/artifacts", "connection_string": null, "container_name": null, "storage_account_blob_url": null }, "cache": { "type": "file", "base_dir": "cache", "connection_string": null, "container_name": null, "storage_account_blob_url": null }, "input": { "type": "file", "file_type": "text", "base_dir": "input", "connection_string": null, "storage_account_blob_url": null, "container_name": null, "encoding": "utf-8", "file_pattern": ".\.txt$", "file_filter": null, "source_column": null, "timestamp_column": null, "timestamp_format": null, "text_column": "text", "title_column": null, "document_attribute_columns": [] }, "embed_graph": { "enabled": false, "num_walks": 10, "walk_length": 40, "window_size": 2, "iterations": 3, "random_seed": 597832, "strategy": null }, "embeddings": { "llm": { "api_key": "", "type": "openai_embedding", "model": "nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q4_K_M.gguf", "max_tokens": 4000, "request_timeout": 180.0, "api_base": "http://localhost:1234/v1", "api_version": null, "organization": null, "proxy": null, "cognitive_services_endpoint": null, "deployment_name": null, "model_supports_json": null, "tokens_per_minute": 0, "requests_per_minute": 0, "max_retries": 10, "max_retry_wait": 10.0, "sleep_on_rate_limit_recommendation": true, "concurrent_requests": 25 }, "parallelization": { "stagger": 0.3, "num_threads": 50 }, "async_mode": "threaded", "batch_size": 16, "batch_max_tokens": 8191, "target": "required", "skip": [], "vector_store": null, "strategy": null }, "chunks": { "size": 400, "overlap": 100, "group_by_columns": [ "id" ], "strategy": null }, "snapshots": { "graphml": false, "raw_entities": false, "top_level_nodes": false }, "entity_extraction": { "llm": { "api_key": "", "type": "openai_chat", "model": "mistral", "max_tokens": 4000, "request_timeout": 180.0, "api_base": "http://127.0.0.1:11434/v1", "api_version": null, "organization": null, "proxy": null, "cognitive_services_endpoint": null, "deployment_name": null, "model_supports_json": true, "tokens_per_minute": 0, "requests_per_minute": 0, "max_retries": 10, "max_retry_wait": 10.0, "sleep_on_rate_limit_recommendation": true, "concurrent_requests": 25 }, "parallelization": { "stagger": 0.3, "num_threads": 50 }, "async_mode": "threaded", "prompt": "prompts/entity_extraction.txt", "entity_types": [ "organization", "person", "geo", "event" ], "max_gleanings": 0, "strategy": null }, "summarize_descriptions": { "llm": { "api_key": "", "type": "openai_chat", "model": "mistral", "max_tokens": 4000, "request_timeout": 180.0, "api_base": "http://127.0.0.1:11434/v1", "api_version": null, "organization": null, "proxy": null, "cognitive_services_endpoint": null, "deployment_name": null, "model_supports_json": true, "tokens_per_minute": 0, "requests_per_minute": 0, "max_retries": 10, "max_retry_wait": 10.0, "sleep_on_rate_limit_recommendation": true, "concurrent_requests": 25 }, "parallelization": { "stagger": 0.3, "num_threads": 50 }, "async_mode": "threaded", "prompt": "prompts/summarize_descriptions.txt", "max_length": 200, "strategy": null }, "community_reports": { "llm": { "api_key": "", "type": "openai_chat", "model": "mistral", "max_tokens": 4000, "request_timeout": 180.0, "api_base": "http://127.0.0.1:11434/v1", "api_version": null, "organization": null, "proxy": null, "cognitive_services_endpoint": null, "deployment_name": null, "model_supports_json": true, "tokens_per_minute": 0, "requests_per_minute": 0, "max_retries": 10, "max_retry_wait": 10.0, "sleep_on_rate_limit_recommendation": true, "concurrent_requests": 25 }, "parallelization": { "stagger": 0.3, "num_threads": 50 }, "async_mode": "threaded", "prompt": null, "max_length": 2000, "max_input_length": 8000, "strategy": null }, "claim_extraction": { "llm": { "api_key": "", "type": "openai_chat", "model": "mistral", "max_tokens": 4000, "request_timeout": 180.0, "api_base": "http://127.0.0.1:11434/v1", "api_version": null, "organization": null, "proxy": null, "cognitive_services_endpoint": null, "deployment_name": null, "model_supports_json": true, "tokens_per_minute": 0, "requests_per_minute": 0, "max_retries": 10, "max_retry_wait": 10.0, "sleep_on_rate_limit_recommendation": true, "concurrent_requests": 25 }, "parallelization": { "stagger": 0.3, "num_threads": 50 }, "async_mode": "threaded", "enabled": false, "prompt": "prompts/claim_extraction.txt", "description": "Any claims or facts that could be relevant to information discovery.", "max_gleanings": 0, "strategy": null }, "cluster_graph": { "max_cluster_size": 10, "strategy": null }, "umap": { "enabled": false }, "local_search": { "text_unit_prop": 0.5, "community_prop": 0.1, "conversation_history_max_turns": 5, "top_k_entities": 10, "top_k_relationships": 10, "max_tokens": 12000, "llm_max_tokens": 2000 }, "global_search": { "max_tokens": 12000, "data_max_tokens": 12000, "map_max_tokens": 1000, "reduce_max_tokens": 2000, "concurrency": 32 }, "encoding_model": "cl100k_base", "skip_workflows": [] } Traceback (most recent call last): File "C:\Users\GodSpeed\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\GodSpeed\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "d:\WORK\PROJECTS\Python-POC\GraphRAG.venv\lib\site-packages\graphrag\query__main__.py", line 84, in run_global_search( File "d:\WORK\PROJECTS\Python-POC\GraphRAG.venv\lib\site-packages\graphrag\query\cli.py", line 71, in run_global_search final_nodes: pd.DataFrame = pd.read_parquet( File "d:\WORK\PROJECTS\Python-POC\GraphRAG.venv\lib\site-packages\pandas\io\parquet.py", line 667, in read_parquet return impl.read( File "d:\WORK\PROJECTS\Python-POC\GraphRAG.venv\lib\site-packages\pandas\io\parquet.py", line 267, in read path_or_handle, handles, filesystem = _get_path_or_handle( File "d:\WORK\PROJECTS\Python-POC\GraphRAG.venv\lib\site-packages\pandas\io\parquet.py", line 140, in _get_path_or_handle handles = get_handle( File "d:\WORK\PROJECTS\Python-POC\GraphRAG.venv\lib\site-packages\pandas\io\common.py", line 882, in get_handle handle = open(handle, ioargs.mode) FileNotFoundError: [Errno 2] No such file or directory: 'D:\WORK\PROJECTS\Python-POC\GraphRAG\ragtest\output\20240717-141348\artifacts\create_final_nodes.parquet' PS D:\WORK\PROJECTS\Python-POC\GraphRAG>

Additional Information

github-actions[bot] commented 1 month ago

This issue has been marked stale due to inactivity after repo maintainer or community member responses that request more information or suggest a solution. It will be closed after five additional days.