microsoft / graphrag

A modular graph-based Retrieval-Augmented Generation (RAG) system
https://microsoft.github.io/graphrag/
MIT License
18.68k stars 1.82k forks source link

[Bug]: NotADirectoryError: [Errno 20] Not a directory: '/Users/username/ragtest/output/.DS_Store/artifacts/create_final_nodes.parquet' #891

Closed semenoffalex closed 2 months ago

semenoffalex commented 2 months ago

Do you need to file an issue?

Describe the bug

I managed to follow your example here and got the message: "All workflows completed successfully", even though I saw "Errors occurred during the pipeline run, see logs for more details" message couple of times at the "create_base_entity_graph" step.

However, when I run the first command for interaction with the graph I got (it took my M2 ~ 5 hours to create it for "A Christmas Carol") I got some Python error.

Command I'm trying:

python -m graphrag.query \
--root ./ragtest \
--method global \
"What are the top themes in this story?"

Error I get:

INFO: Reading settings from ragtest/settings.yaml
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/opt/miniconda3/envs/graphollama/lib/python3.11/site-packages/graphrag/query/__main__.py", line 83, in <module>
    run_global_search(
  File "/opt/miniconda3/envs/graphollama/lib/python3.11/site-packages/graphrag/query/cli.py", line 67, in run_global_search
    final_nodes: pd.DataFrame = pd.read_parquet(
                                ^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/graphollama/lib/python3.11/site-packages/pandas/io/parquet.py", line 667, in read_parquet
    return impl.read(
           ^^^^^^^^^^
  File "/opt/miniconda3/envs/graphollama/lib/python3.11/site-packages/pandas/io/parquet.py", line 267, in read
    path_or_handle, handles, filesystem = _get_path_or_handle(
                                          ^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/graphollama/lib/python3.11/site-packages/pandas/io/parquet.py", line 140, in _get_path_or_handle
    handles = get_handle(
              ^^^^^^^^^^^
  File "/opt/miniconda3/envs/graphollama/lib/python3.11/site-packages/pandas/io/common.py", line 882, in get_handle
    handle = open(handle, ioargs.mode)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
NotADirectoryError: [Errno 20] Not a directory: '/Users/username/ragtest/output/.DS_Store/artifacts/create_final_nodes.parquet'

Steps to reproduce

  1. Install GraphRAG using this tutorial: https://www.fahdmirza.com/2024/07/install-microsoft-graphrag-with-ollama.html
  2. Edit openai_embeddings_llm.py to replace openai with mistral and nomic_embed_text for embeddings
  3. mkdir -p ./ragtest/input
  4. curl https://www.gutenberg.org/cache/epub/24022/pg24022.txt > ./ragtest/input/book.txt
  5. python -m graphrag.index --init --root ./ragtest
  6. python -m graphrag.index --root ./ragtest
  7. python -m graphrag.query \ --root ./ragtest \ --method global \ "What are the top themes in this story?"

Expected Behavior

I expect to get the answer for the question: "What are the top themes in this story?"

GraphRAG Config Used


encoding_model: cl100k_base
skip_workflows: []
llm:
  api_key: ${GRAPHRAG_API_KEY}
  type: openai_chat # or azure_openai_chat
  model: mistral
  model_supports_json: true # recommended if this is available for your model.
  # max_tokens: 4000
  # request_timeout: 180.0
  api_base: http://localhost:11434/v1
  # api_version: 2024-02-15-preview
  # organization: <organization_id>
  # deployment_name: <azure_model_deployment_name>
  # tokens_per_minute: 150_000 # set a leaky bucket throttle
  # requests_per_minute: 10_000 # set a leaky bucket throttle
  # max_retries: 10
  # max_retry_wait: 10.0
  # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
  # concurrent_requests: 25 # the number of parallel inflight requests that may be made

parallelization:
  stagger: 0.3
  # num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:
  ## parallelization: override the global parallelization settings for embeddings
  async_mode: threaded # or asyncio
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: openai_embedding # or azure_openai_embedding
    model: nomic_embed_text
    api_base: http://localhost:11434/api
    # api_version: 2024-02-15-preview
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>
    # tokens_per_minute: 150_000 # set a leaky bucket throttle
    # requests_per_minute: 10_000 # set a leaky bucket throttle
    # max_retries: 10
    # max_retry_wait: 10.0
    # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
    # concurrent_requests: 25 # the number of parallel inflight requests that may be made
    # batch_size: 16 # the number of documents to send in a single request
    # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
    # target: required # or optional

chunks:
  size: 300
  overlap: 100
  group_by_columns: [id] # by default, we don't allow chunks to cross documents

input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

cache:
  type: file # or blob
  base_dir: "cache"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

storage:
  type: file # or blob
  base_dir: "output/${timestamp}/artifacts"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

reporting:
  type: file # or console, blob
  base_dir: "output/${timestamp}/reports"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

entity_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 0

summarize_descriptions:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  # enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 0

community_report:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes
  # num_walks: 10
  # walk_length: 40
  # window_size: 2
  # iterations: 3
  # random_seed: 597832

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: false
  raw_entities: false
  top_level_nodes: false

local_search:
  # text_unit_prop: 0.5
  # community_prop: 0.1
  # conversation_history_max_turns: 5
  # top_k_mapped_entities: 10
  # top_k_relationships: 10
  # max_tokens: 12000

global_search:
  # max_tokens: 12000
  # data_max_tokens: 12000
  # map_max_tokens: 1000
  # reduce_max_tokens: 2000
  # concurrency: 32

Logs and screenshots

image

Additional Information

natoverse commented 2 months ago

.DS_Store is a mac file, not a directory. The query library tries to find the most recent timestamped run - for some reason it is picking up this .DS_Store file instead. We'll take a look at the selection code to ensure it is looking only for folders.

As an immediate fix, you should be able to delete that file (it is just an OS cache of view settings) and re-run.

If you'd like to be more precise each time you run, you can add the --data param on the CLI and point to exactly the folder of artifacts that you want to query from (e.g., {root}/output/{timestamp}/artifacts).

semenoffalex commented 2 months ago

As an immediate fix, you should be able to delete that file (it is just an OS cache of view settings) and re-run.

Thanks! This time it moved further, but a new error appeared:

INFO: Reading settings from ragtest/settings.yaml
creating llm client with {'api_key': 'REDACTED,len=9', 'type': "openai_chat", 'model': 'mistral', 'max_tokens': 4000, 'request_timeout': 180.0, 'api_base': 'http://localhost:11434/v1', 'api_version': None, 'organization': None, 'proxy': None, 'cognitive_services_endpoint': None, 'deployment_name': None, 'model_supports_json': True, 'tokens_per_minute': 0, 'requests_per_minute': 0, 'max_retries': 10, 'max_retry_wait': 10.0, 'sleep_on_rate_limit_recommendation': True, 'concurrent_requests': 25}
Error parsing search response json
Traceback (most recent call last):
  File "/opt/miniconda3/envs/graphollama/lib/python3.11/site-packages/graphrag/query/structured_search/global_search/search.py", line 194, in _map_response_single_batch
    processed_response = self.parse_search_response(search_response)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/graphollama/lib/python3.11/site-packages/graphrag/query/structured_search/global_search/search.py", line 232, in parse_search_response
    parsed_elements = json.loads(search_response)["points"]
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/graphollama/lib/python3.11/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/graphollama/lib/python3.11/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/graphollama/lib/python3.11/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

SUCCESS: Global Search Response: I am sorry but I am unable to answer this question given the provided data.

The command was the same:

python3 -m graphrag.query \
--root ./ragtest \
--method global \
"What are the top themes in this story?"
natoverse commented 2 months ago

We've seen a fair bit of reporting on non-OpenAI model JSON formats. 0.2.1 included some improvements to the fallback parsing when a model returns malformed JSON, but it may still have issues that we are unaware of. Unfortunately there's not a lot we can do to help diagnose these since it would be a lot of work to test all the models available. My best recommendation is to search through issues linked to #657 to see what solutions folks have found.

9prodhi commented 2 months ago

It seems you might be using Ollama to run the Mistral model locally. To help resolve the issue, following setup can be helpful

  1. Use the configuration file below (I've removed commented lines for clarity):
encoding_model: cl100k_base
skip_workflows: []
llm:
  api_key: ${GRAPHRAG_API_KEY}
  type: openai_chat
  model: mistral
  model_supports_json: true
  api_base: http://localhost:11434/v1

parallelization:
  stagger: 120

async_mode: threaded

embeddings:
  async_mode: threaded
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: openai_embedding
    model: nomic-ai/nomic-embed-text-v1.5-GGUF
    api_base: http://localhost:8001/v1/
    concurrent_requests: 2

chunks:
  size: 300
  overlap: 100
  group_by_columns: [id]

input:
  type: file
  file_type: text
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

cache:
  type: file
  base_dir: "cache"

storage:
  type: file
  base_dir: "output/${timestamp}/artifacts"

reporting:
  type: file
  base_dir: "output/${timestamp}/reports"

entity_extraction:
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 0

summarize_descriptions:
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 0

community_report:
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false

umap:
  enabled: false

snapshots:
  graphml: true
  raw_entities: true
  top_level_nodes: false
  1. Next, you'll need to set up the embedding server for the nomic-embed-text model. Clone the following repository and use the ollama_serv.py script to serve the API for embeddings:
   git clone https://github.com/9prodhi/EmbedAdapter
   cd EmbedAdapter
   python ollama_serv.py
  1. Ensure that your Ollama instance for the Mistral & nomic-embed-text model is running and accessible at http://localhost:11434/v1.
  2. Once the embedding server for nomic-embed-text is running, it should be accessible at http://localhost:8001/v1/, which matches the api_base URL in the embeddings configuration.

By following these steps, you should be able to resolve the JSON parsing issues and get your local setup working correctly with the Mistral model for LLM tasks and the nomic-embed-text model for embeddings.

If you continue to experience problems, please provide more details about the specific error you're encountering, and I'll be happy to assist further.

natoverse commented 2 months ago

Original bug fixed with #910, routing this conversation to #657 for Ollama tuning.

Aaron-genai-superdev commented 2 months ago

Dear All, I solved the problem! The following is my steps:

Firstly, I get the same problem of "NotADirectoryError: [Errno 20] Not a directory: '/Users/xxxxx/ragtest/output/.DS_Store/artifacts/create_final_nodes.parquet' ,too.

image

Then, as @natoverse said, DS_store is just a temp file. and the system is looking for a file named create_final_nodes.parquet. If we want to get the file, we use --data.

Third, I get help from cli command, the usage is :" python -m graphrag.query [-h] [--config CONFIG] [--data DATA] [--root ROOT] --method {local,global}[--community_level COMMUNITY_LEVEL] [--response_type RESPONSE_TYPE] query"

Then I use the cli command:""python -m graphrag.query \ --root ./ragtest \ --method global \ --data ./ragtest/output/20240820-232301/artifacts/ \ "What are the top themes in this story?""".

It works!! Attention, the 20240820-232301 is my timeslot. JUST look for the artifacts directory and then you put the path of artifacts to data is fine!!

image