microsoft / graphrag

A modular graph-based Retrieval-Augmented Generation (RAG) system
https://microsoft.github.io/graphrag/
MIT License
18.67k stars 1.82k forks source link

[Bug]: Query client fails when using a vector store #770

Open nievespg1 opened 3 months ago

nievespg1 commented 3 months ago

Do you need to file an issue?

Describe the bug

Bash client fails when running a local search query using a vector store.

Steps to reproduce

  1. Define a vector store in your settings.yaml.
    embeddings:
    <other-params>
      ...
    vector_store: 
      type: lancedb
      overwrite: true
      db_uri: /path/to/vector/db
  2. Run a local query using the bash client.
    python -m graphrag.query \
    --config <path/to/settings>/settings.yaml \
    --data <path/to/index/dir>/storage \
    --community_level 2 \
    --method local "Local Query"

Expected Behavior

Expect to see a ValueError notifying you that there is no vector column in the data. See image in the Logs and screenshots section for full stack trace.

GraphRAG Config Used

# Define anchors to be reused
openai_api_key_smt_octo: &openai_api_key_smt_octo ${OPENAI_API_KEY_SMT_OCTO}

#######################
# pipeline parameters # 
#######################

# data inputs
input:
  type: file
  file_type: text
  file_pattern: .*\.txt$
  base_dir: <path/to/base/dir>/data-01p

# tokenizer model name
encoding_model: &encoding_name o200k_base # gpt-4o
# encoding_model: &encoding_name cl100k_base # gpt-4-turbo

# text chunking
chunks:
  size: &chunk_size 700 # 700 tokens (about 2800 characters)
  overlap: &chunk_overlap 100 # 100 tokens (about 400 characters)
  strategy:
      type: tokens
      chunk_size: *chunk_size
      chunk_overlap: *chunk_overlap
      encoding_name: *encoding_name

# chat llm inputs
llm: &chat_llm
  api_key: *openai_api_key_smt_octo
  type: openai_chat
  model: gpt-4o-mini
  max_tokens: 4096
  request_timeout: 180 # 3 minutes should make sure we can handle busy AOAI instances
  api_version: "2024-02-01"
  # deployment_name: gpt-4o-mini
  model_supports_json: true
  tokens_per_minute: 1000000
  requests_per_minute: 10000
  max_retries: 20
  max_retry_wait: 10
  sleep_on_rate_limit_recommendation: true
  concurrent_requests: 50

parallelization: &parallelization
  stagger: 0.25
  num_threads: 100

async_mode: &async_mode asyncio
# async_mode: &async_mode threaded

entity_extraction:
  llm: *chat_llm
  parallelization: *parallelization
  async_mode: *async_mode
  prompt: <path/to/base/dir>/prompts/entity_extraction.txt
  max_gleanings: 1

summarize_descriptions:
  llm: *chat_llm
  parallelization: *parallelization
  async_mode: *async_mode
  prompt: <path/to/base/dir>/prompts/summarize_descriptions.txt
  max_length: 500

community_reports:
  llm: *chat_llm
  parallelization: *parallelization
  async_mode: *async_mode
  prompt: <path/to/base/dir>/prompts/community_report.txt
  max_length: &max_report_length 2000
  max_input_length: 8000

# embeddings llm inputs
embeddings:
  llm:
    api_key: *openai_api_key_smt_octo
    type: openai_embedding
    model: text-embedding-ada-002
    request_timeout: 180 # 3 minutes should make sure we can handle busy AOAI instances
    api_version: "2024-02-01"
    # deployment_name: text-embedding-ada-002
    model_supports_json: false
    tokens_per_minute: 10000000
    requests_per_minute: 10000
    max_retries: 20
    max_retry_wait: 10
    sleep_on_rate_limit_recommendation: true
    concurrent_requests: 50
  parallelization: *parallelization
  async_mode: *async_mode
  batch_size: 16
  batch_max_tokens: 8191
  vector_store: 
      type: lancedb
      overwrite: true
      db_uri: <path/to/base/dir>/index/storage/lancedb
      query_collection_name: entity_description_embeddings

cache:
  type: file
  base_dir: <path/to/base/dir>/index/cache

storage:
  type: file
  base_dir: <path/to/base/dir>/index/storage

reporting:
  type: file
  base_dir: <path/to/base/dir>/index/reporting

snapshots:
  graphml: true
  raw_entities: true
  top_level_nodes: true

#####################################
# orchestration (query) definitions # 
#####################################
local_search:
  text_unit_prop: 0.5
  community_prop: 0.1
  conversation_history_max_turns: 5
  top_k_entities: 10
  top_k_relationships: 10
  temperature: 0.0
  top_p: 1.0
  n: 1
  max_tokens: 12000
  llm_max_tokens: 2000

global_search:
  temperature: 0.0
  top_p: 1.0
  n: 1
  max_tokens: 12000
  data_max_tokens: 12000
  map_max_tokens: 1000
  reduce_max_tokens: 2000
  concurrency: 50

Logs and screenshots

image

Additional Information

zaraken commented 3 months ago

The expected behavior ought to be the behavior if the bug did not exist. The ValueError is the actual behavior, i.e. what actually and erroneously happens, correct?

Harbon commented 3 months ago

I have the same issue after #771 is merged

whale567 commented 1 month ago

same issue, do you solve it?

msma commented 1 week ago

same issue, do you solve it?