microsoft / graphrag

A modular graph-based Retrieval-Augmented Generation (RAG) system
https://microsoft.github.io/graphrag/
MIT License
17.14k stars 1.61k forks source link

[Bug]: `KeyError: 'reports'` when doing Local Search #934

Open l4b4r4b4b4 opened 1 month ago

l4b4r4b4b4 commented 1 month ago

Do you need to file an issue?

Describe the bug

I am indexing and querying with graphrag library using a OpenAI compatible version of vLLM. With tools, functions and embeddings.

Everything works as expected, running the local and global search notebooks against the constructed index.

However when trying to get the report in the local search notebook, I get the following error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[17], line 1
----> 1 result.context_data["reports"].head()

KeyError: 'reports'

Steps to reproduce

No response

Expected Behavior

to get a report similar to the one I get in the global notebook.

GraphRAG Config Used

encoding_model: cl100k_base
skip_workflows: []
llm:
  api_key: ${GRAPHRAG_API_KEY}
  type: openai_chat # or azure_openai_chat
  model: text-assistant
  model_supports_json: true # recommended if this is available for your model.
  # max_tokens: 4000
  # request_timeout: 180.0
  # api_base: https://<instance>.openai.azure.com
  # api_version: 2024-02-15-preview
  # organization: <organization_id>
  # deployment_name: <azure_model_deployment_name>
  # tokens_per_minute: 150_000 # set a leaky bucket throttle
  # requests_per_minute: 10_000 # set a leaky bucket throttle
  # max_retries: 10
  # max_retry_wait: 10.0
  # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
  # concurrent_requests: 25 # the number of parallel inflight requests that may be made
  # temperature: 0 # temperature for sampling
  # top_p: 1 # top-p sampling
  # n: 1 # Number of completions to generate

parallelization:
  stagger: 0.3
  # num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:
  ## parallelization: override the global parallelization settings for embeddings
  async_mode: threaded # or asyncio
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: openai_embedding # or azure_openai_embedding
    model: text-embedding-3-small
    # api_base: https://<instance>.openai.azure.com
    # api_version: 2024-02-15-preview
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>
    # tokens_per_minute: 150_000 # set a leaky bucket throttle
    # requests_per_minute: 10_000 # set a leaky bucket throttle
    # max_retries: 10
    # max_retry_wait: 10.0
    # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
    # concurrent_requests: 25 # the number of parallel inflight requests that may be made
    # batch_size: 16 # the number of documents to send in a single request
    # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
    # target: required # or optional

chunks:
  size: 1200
  overlap: 100
  group_by_columns: [id] # by default, we don't allow chunks to cross documents

input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

cache:
  type: file # or blob
  base_dir: "cache"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

storage:
  type: file # or blob
  base_dir: "output/${timestamp}/artifacts"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

reporting:
  type: file # or console, blob
  base_dir: "output/${timestamp}/reports"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

entity_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization, person, geo, event]
  max_gleanings: 1

summarize_descriptions:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  # enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 1

community_reports:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes
  # num_walks: 10
  # walk_length: 40
  # window_size: 2
  # iterations: 3
  # random_seed: 597832

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: false
  raw_entities: false
  top_level_nodes: false

local_search:
  # text_unit_prop: 0.5
  # community_prop: 0.1
  # conversation_history_max_turns: 5
  # top_k_mapped_entities: 10
  # top_k_relationships: 10
  # llm_temperature: 0 # temperature for sampling
  # llm_top_p: 1 # top-p sampling
  # llm_n: 1 # Number of completions to generate
  # max_tokens: 12000

global_search:
  # llm_temperature: 0 # temperature for sampling
  # llm_top_p: 1 # top-p sampling
  # llm_n: 1 # Number of completions to generate
  # max_tokens: 12000
  # data_max_tokens: 12000
  # map_max_tokens: 1000
  # reduce_max_tokens: 2000
  # concurrency: 32

Logs and screenshots

No response

Additional Information

xgl0626 commented 1 month ago

If you look at the log logs in the output folder, you will find that there is an error when the community report is generated at the end, and I also encountered this problem

l4b4r4b4b4 commented 1 month ago

If you look at the log logs in the output folder, you will find that there is an error when the community report is generated at the end, and I also encountered this problem

Nope no error logs on anything.

xgl0626 commented 1 month ago

You can see this file graphrag-0.2.2/output/20240812-090029/reports/indexing-engine.log

l4b4r4b4b4 commented 1 month ago

im using 0.3.0 but yes. no errors!

wanchichang commented 1 month ago

@l4b4r4b4b4 Because the result.context_data[] from the local search only contains the entities, relationships, and sources fields, while the global search includes result.context_data['reports'].

l4b4r4b4b4 commented 1 month ago

@l4b4r4b4b4 Because the result.context_data[] from the local search only contains the entities, relationships, and sources fields, while the global search includes result.context_data['reports'].

ok, so its simply a mistake that reports are called in the local jupyter notebook?

wanchichang commented 1 month ago

@l4b4r4b4b4 Because the result.context_data[] from the local search only contains the entities, relationships, and sources fields, while the global search includes result.context_data['reports'].

ok, so its simply a mistake that reports are called in the local jupyter notebook?

Yes, you just need to comment out that line, and local search will run normally.

natoverse commented 3 weeks ago

You can comment out the community reports to get unblocked, but we do expect them for local search results. Some things to check:

natoverse commented 3 weeks ago

You can comment out the community reports to get unblocked, but we do expect them for local search results. Some things to check:

  • Does the notebook run fine with the example parquets?
  • Does your create_final_community_reports.parquet look reasonable compared to the example?
  • Does the CLI query work when pointed at your index, or is this issue clearly isolated to the example notebook?

I just ran the local search example notebook again with the example data and it worked as expected.