microsoft / graphrag

A modular graph-based Retrieval-Augmented Generation (RAG) system
https://microsoft.github.io/graphrag/
MIT License
17.29k stars 1.65k forks source link

Issue: Errors Occuring During Pipeline Run #782

Closed davidgross631 closed 1 month ago

davidgross631 commented 1 month ago

Is there an existing issue for this?

Describe the issue

Ok, so basically I'm following this set-up guide:

Get Started (microsoft.github.io)

Everything was going fine until I ran the pipeline with this command:

python -m graphrag.index --root ./ragtest

This is the error:

image

I am using Open AI and I have the proper API key put in .env file. However, I did change the settings.yaml file to have the model be gpt-4o-mini because my API key supports that.

I have tried looking at the log files and no real help there either. I am just wondering what the issue could possibly be caused by.

Steps to reproduce

You can replicate the issue basically by following the exact same steps here and use OpenAI NOT AzureOpenAI:

https://microsoft.github.io/graphrag/posts/get_started/

GraphRAG Config Used

# Paste your config here

encoding_model: cl100k_base
skip_workflows: []
llm:
  api_key: ${GRAPHRAG_API_KEY}
  type: openai_chat # or azure_openai_chat
  model: gpt-4o-mini
  model_supports_json: true # recommended if this is available for your model.
  # max_tokens: 4000
  # request_timeout: 180.0
  # api_base: https://<instance>.openai.azure.com
  # api_version: 2024-02-15-preview
  # organization: <organization_id>
  # deployment_name: <azure_model_deployment_name>
  # tokens_per_minute: 150_000 # set a leaky bucket throttle
  # requests_per_minute: 10_000 # set a leaky bucket throttle
  # max_retries: 10
  # max_retry_wait: 10.0
  # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
  # concurrent_requests: 25 # the number of parallel inflight requests that may be made
  # temperature: 0 # temperature for sampling
  # top_p: 1 # top-p sampling
  # n: 1 # Number of completions to generate

parallelization:
  stagger: 0.3
  # num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:
  ## parallelization: override the global parallelization settings for embeddings
  async_mode: threaded # or asyncio
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: openai_embedding # or azure_openai_embedding
    model: text-embedding-3-small
    # api_base: https://<instance>.openai.azure.com
    # api_version: 2024-02-15-preview
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>
    # tokens_per_minute: 150_000 # set a leaky bucket throttle
    # requests_per_minute: 10_000 # set a leaky bucket throttle
    # max_retries: 10
    # max_retry_wait: 10.0
    # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
    # concurrent_requests: 25 # the number of parallel inflight requests that may be made
    # batch_size: 16 # the number of documents to send in a single request
    # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
    # target: required # or optional

chunks:
  size: 1200
  overlap: 100
  group_by_columns: [id] # by default, we don't allow chunks to cross documents

input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

cache:
  type: file # or blob
  base_dir: "cache"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

storage:
  type: file # or blob
  base_dir: "output/${timestamp}/artifacts"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

reporting:
  type: file # or console, blob
  base_dir: "output/${timestamp}/reports"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

entity_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 1

summarize_descriptions:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  # enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 1

community_reports:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes
  # num_walks: 10
  # walk_length: 40
  # window_size: 2
  # iterations: 3
  # random_seed: 597832

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: false
  raw_entities: false
  top_level_nodes: false

local_search:
  # text_unit_prop: 0.5
  # community_prop: 0.1
  # conversation_history_max_turns: 5
  # top_k_mapped_entities: 10
  # top_k_relationships: 10
  # llm_temperature: 0 # temperature for sampling
  # llm_top_p: 1 # top-p sampling
  # llm_n: 1 # Number of completions to generate
  # max_tokens: 12000

global_search:
  # llm_temperature: 0 # temperature for sampling
  # llm_top_p: 1 # top-p sampling
  # llm_n: 1 # Number of completions to generate
  # max_tokens: 12000
  # data_max_tokens: 12000
  # map_max_tokens: 1000
  # reduce_max_tokens: 2000
  # concurrency: 32

Logs and screenshots

Log Folder: image

Main.log: image

Network.log in window2: image

Additional Information

luneice commented 1 month ago

贴一些日志吧,在文件夹 ragtest/output/202407xxxx/ 里面 Attach some logs, which can be found in ragtest/output/202407xxxx/ folder.

davidgross631 commented 1 month ago

logs.json indexing-engine.log stats.json

davidgross631 commented 1 month ago

Not sure how helpful this is, but thank you regardless!

luneice commented 1 month ago
15:33:28,850 graphrag.index.input.text INFO found text files from input, found [('book.txt', {})]
15:33:28,855 graphrag.index.input.text WARNING Warning! Error loading file book.txt. Skipping...
15:33:28,855 graphrag.index.input.text INFO Found 1 files, loading 0
15:33:28,855 graphrag.index.workflows.load INFO Workflow Run Order: ['create_base_text_units', 'create_base_extracted_entities', 'create_summarized_entities', 'create_base_entity_graph', 'create_final_entities', 'create_final_nodes', 'create_final_communities', 'join_text_units_to_entity_ids', 'create_final_relationships', 'join_text_units_to_relationship_ids', 'create_final_community_reports', 'create_final_text_units', 'create_base_documents', 'create_final_documents']

create_base_text_units workflow got error, caused by 'id'. First, Check the boot.txt content. and then check your prompt.

davidgross631 commented 1 month ago

book.txt This is the book.txt content, not sure why there would be an issue because I followed this command from the guide online: curl https://www.gutenberg.org/cache/epub/24022/pg24022.txt > ./ragtest/input/book.txt Maybe, I just put an excerpt from the book in book.txt and move on from there?

davidgross631 commented 1 month ago

Now, I am getting this error in the log: logs.json

natoverse commented 1 month ago

@davidgross631 can you try setting encoding_model: o200k_base in your settings.yaml? With gpt4o OAI changed their encoding model (see tiktoken mapping). We can update the docs to help clarify this.

davidgross631 commented 1 month ago

Hm, still having issues unfortunately. Very similar to the Error #779

I do not have/see a indexing-engine.log but I commented out the api_base line in settings.yaml to put in https://api.openai.com/v1/

Still did not work sadly. Still getting this same message. I also tried to mess with the max-tokens as reffered to in a different issue but still no luck.

image

natoverse commented 1 month ago

Ok, thanks for trying that. So your issue may be happening before the encoding_model setting is even a potential problem then. Are you directly on Windows, or using WSL? I am hearing there can be utf-8 encoding issues with default Windows, and folks are having luck running in WSL if that's an option to try.

book.txt This is the book.txt content, not sure why there would be an issue because I followed this command from the guide online: curl https://www.gutenberg.org/cache/epub/24022/pg24022.txt > ./ragtest/input/book.txt Maybe, I just put an excerpt from the book in book.txt and move on from there?

Certainly worth a try in case curl didn't download it correctly

davidgross631 commented 1 month ago

Windows, I've already set the text to UTF-8 encoding using Notepad++

davidgross631 commented 1 month ago

I have also used WSL/Ubuntu and Windows with no luck for either

luneice commented 1 month ago

I have also used WSL/Ubuntu and Windows with no luck for either

After deleting cache folder and output folder, I got same error every tries.

qizhanghw commented 1 month ago

Hm, still having issues unfortunately. Very similar to the Error #779

I do not have/see a indexing-engine.log but I commented out the api_base line in settings.yaml to put in https://api.openai.com/v1/

Still did not work sadly. Still getting this same message. I also tried to mess with the max-tokens as reffered to in a different issue but still no luck.

image

I have the same problem. Have you solved it?

davidgross631 commented 1 month ago

Hm, still having issues unfortunately. Very similar to the Error #779 I do not have/see a indexing-engine.log but I commented out the api_base line in settings.yaml to put in https://api.openai.com/v1/ Still did not work sadly. Still getting this same message. I also tried to mess with the max-tokens as reffered to in a different issue but still no luck. image

I have the same problem. Have you solved it?

I have, it was actually an issue with the OpenAI API key. Make sure that you are not using a free version or atleast double check that there aren't any billing/rate issues. That was the case with me and didn't realize it until trying it with WSL.

KumarAditya98 commented 19 hours ago

I'm still facing this exact same issue. I am using Azure Open AI instead of Open AI. Anyone who was using Azure OpenAI that has solved this? Performed all the recommended steps but no luck.