microsoft / graphrag

A modular graph-based Retrieval-Augmented Generation (RAG) system
https://microsoft.github.io/graphrag/
MIT License
17.87k stars 1.72k forks source link

[Bug]: empty workflows list, no indexing is done #489

Open vv111y opened 2 months ago

vv111y commented 2 months ago

Describe the bug

Using in google colab. I used several different settings.yaml files to try to get it to work, including initial stock with .env file. One time starting in a new folder from scratch it worked partly (errored out before all workflow tasks done), but then after problem persists. I can see no pattern for the cause. please see indexing-engine.log

Steps to reproduce

  1. use google colab to run.
  2. pip install graphrag.
    note, error:
    ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
    cudf-cu12 24.4.1 requires pandas<2.2.2dev0,>=2.0, but you have pandas 2.2.2 which is incompatible.
    cudf-cu12 24.4.1 requires pyarrow<15.0.0a0,>=14.0.1, but you have pyarrow 15.0.0 which is incompatible.
    google-colab 1.0.0 requires pandas==2.0.3, but you have pandas 2.2.2 which is incompatible.
  3. run indexing using several different settings.yaml, with combinations of using .env or directly entering config in settings file. Including stock settings.yaml
  4. empty artifacts folder, no workflow tasks done

Expected Behavior

workflow list should be fully populated and all tasks run correctly. At best have only had a few partial runs, now nothing is done

GraphRAG Config Used

encoding_model: cl100k_base
# encoding_model: ${GRAPHRAG_ENCODING_MODEL} 
skip_workflows: []
llm:
  api_key: ${GRAPHRAG_API_KEY}
  type: openai_chat # or azure_openai_chat
  # model: gpt-4-turbo-preview
  model: ${GRAPHRAG_MODEL} 
  model_supports_json: true # recommended if this is available for your model.
  # max_tokens: 4000
  # request_timeout: 180.0
  # api_base: ${GRAPHRAG_API_BASE} 
  # api_version: 2024-02-15-preview
  # organization: <organization_id>
  # deployment_name: <azure_model_deployment_name>
  # tokens_per_minute: 150_000 # set a leaky bucket throttle
  # requests_per_minute: 10_000 # set a leaky bucket throttle
  # max_retries: 10
  # max_retry_wait: 10.0
  # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
  concurrent_requests: 5 # the number of parallel inflight requests that may be made

parallelization:
  stagger: 0.3
  # num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:
  ## parallelization: override the global parallelization settings for embeddings
  async_mode: threaded # or asyncio
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: openai_embedding # or azure_openai_embedding
    # model: text-embedding-3-small
    model: ${GRAPHRAG_EMBEDDING_MODEL}
    # api_base: ${GRAPHRAG_API_BASE} 
    # api_version: 2024-02-15-preview
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>
    # tokens_per_minute: 150_000 # set a leaky bucket throttle
    # requests_per_minute: 10_000 # set a leaky bucket throttle
    # max_retries: 10
    # max_retry_wait: 10.0
    # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
    # concurrent_requests: 25 # the number of parallel inflight requests that may be made
    # batch_size: 16 # the number of documents to send in a single request
    # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
    # target: required # or optional

chunks:
  size: 300
  overlap: 100
  group_by_columns: [id] # by default, we don't allow chunks to cross documents

input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

cache:
  type: file # or blob
  base_dir: "cache"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

storage:
  type: file # or blob
  base_dir: "output/${timestamp}/artifacts"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

reporting:
  type: file # or console, blob
  base_dir: "output/${timestamp}/reports"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

entity_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 0

summarize_descriptions:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  # enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 0

community_report:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes
  # num_walks: 10
  # walk_length: 40
  # window_size: 2
  # iterations: 3
  # random_seed: 597832

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: false
  raw_entities: false
  top_level_nodes: false

local_search:
  # text_unit_prop: 0.5
  # community_prop: 0.1
  # conversation_history_max_turns: 5
  # top_k_mapped_entities: 10
  # top_k_relationships: 10
  # max_tokens: 12000

global_search:
  # max_tokens: 12000
  # data_max_tokens: 12000
  # map_max_tokens: 1000
  # reduce_max_tokens: 2000
  # concurrency: 32

Logs and screenshots

indexing-engine.log

14:57:04,749 graphrag.index.run INFO Running pipeline with config settings.yaml
14:57:04,751 graphrag.config.read_dotenv INFO Loading pipeline .env file
14:57:05,473 graphrag.index.storage.file_pipeline_storage INFO Creating file storage at output/20240710-145704/artifacts
14:57:05,482 graphrag.index.input.load_input INFO loading input from root_dir=input
14:57:05,482 graphrag.index.input.load_input INFO using file storage for input
14:57:05,486 graphrag.index.storage.file_pipeline_storage INFO search /content/drive/MyDrive/.../2024-07-10/input for files matching .*\.txt$
14:57:05,488 graphrag.index.input.text INFO found text files from input, found [('wildfly_jira_compact_3.txt', {}), ('wildfly_jira_compact_2.txt', {}), ('wildfly_jira_compact_1.txt', {})]
14:57:05,504 graphrag.index.workflows.load INFO Workflow Run Order: []
14:57:05,505 graphrag.index.run INFO Final # of rows loaded: 3

Additional Information

vv111y commented 2 months ago

I removed the cache, double checked .env, and tried the following minimal settings.yaml, and still same error.

llm:
  api_key: ${GRAPHRAG_API_KEY}
  type: openai_chat # or azure_openai_chat
  model: meta-llama/Llama-3-8b-chat-hf
  api_base: https://api.together.xyz/v1

embeddings:
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: openai_embedding # or azure_openai_embedding
    model: togethercomputer/m2-bert-80M-2k-retrieval
    api_base: https://api.together.xyz/v1

chunks:
  size: 300
  overlap: 100
  group_by_columns: [id]

input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

cache:
  type: file # or blob
  base_dir: "cache"

storage:
  type: file # or blob
  base_dir: "output/${timestamp}/artifacts"

reporting:
  type: file # or console, blob
  base_dir: "output/${timestamp}/reports"

entity_extraction:
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization, person, geo, event]
  max_gleanings: 0
jgbradley1 commented 2 months ago

Are you trying to run indexing using the command line interface? i.e. python -m graphrag.indexing ...?

I added a change last week that should address your problem. A new command line flag --overlay-defaults will be available that inherits default values (i.e. the workflow steps that are missing from your yaml) in addition to the custom values that your config has declared.

You can either build the python package from source (run poetry build from the root directory of this repo and re-install the wheel) or wait until the next release to start using this new feature.

vv111y commented 2 months ago

right, I should have specified that.

python -m graphrag.index --config <some-settings.yaml> --root .

To be clear I tried multiple settings.yaml files including ones that spedified execution of all work units. All resulted in no workflow steps. I'm installing via main branch and can use --overlay-defaults.

pip install git+https://github.com/microsoft/graphrag@main 

But settings are being ignored still. --overlay-defaults seems to act as a bandaid for some settings. For example, when I add

embed_graph:
  enabled: true # if true, will generate node2vec embeddings for nodes
  num_walks: 10
  walk_length: 40
  window_size: 2
  iterations: 3
  random_seed: 597832

umap:
  enabled: true # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: true
  raw_entities: true
  top_level_nodes: true

embed_graph, graphml, raw_entities, umap, and top_level_nodes are not being generated.

vv111y commented 2 months ago

additionally, when I try a local search there seems to be missing lancedb dataset, see first line below. The last line I wonder if that is an issue with trying to run in colab and maybe a separate issue.

[2024-07-12T14:44:16Z WARN  lance::dataset] No existing dataset at /content/drive/MyDrive/OrangePro/runs/2024-07-10/lancedb/description_embedding.lance, it will be created
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/graphrag/query/__main__.py", line 76, in <module>
    run_local_search(
  File "/usr/local/lib/python3.10/dist-packages/graphrag/query/cli.py", line 132, in run_local_search
    store_entity_semantic_embeddings(
  File "/usr/local/lib/python3.10/dist-packages/graphrag/query/input/loaders/dfs.py", line 91, in store_entity_semantic_embeddings
    vectorstore.load_documents(documents=documents)
  File "/usr/local/lib/python3.10/dist-packages/graphrag/vector_stores/lancedb.py", line 55, in load_documents
    self.document_collection = self.db_connection.create_table(
  File "/usr/local/lib/python3.10/dist-packages/lancedb/db.py", line 418, in create_table
    tbl = LanceTable.create(
  File "/usr/local/lib/python3.10/dist-packages/lancedb/table.py", line 1545, in create
    lance.write_dataset(empty, tbl._dataset_uri, schema=schema, mode=mode)
  File "/usr/local/lib/python3.10/dist-packages/lance/dataset.py", line 2506, in write_dataset
    inner_ds = _write_dataset(reader, uri, params)
OSError: LanceError(IO): Generic LocalFileSystem error: Unable to copy file from /content/drive/MyDrive/OrangePro/runs/2024-07-10/lancedb/description_embedding.lance/_versions/.tmp_1.manifest_add4893a-5209-4899-81ae-c25465719626 to /content/drive/MyDrive/OrangePro/runs/2024-07-10/lancedb/description_embedding.lance/_versions/1.manifest: Function not implemented (os error 38), /home/runner/work/lance/lance/rust/lance-table/src/io/commit.rs:692:54
vv111y commented 2 months ago

@jgbradley1 the issue is still not fixed, only partially, several artifacts are still not produced - the settings file is being, at least partly, ignored. Is there some issue with maybe say whitespace malformed yaml? Just guessing now

As posted above,

embed_graph:
  enabled: true # if true, will generate node2vec embeddings for nodes
  num_walks: 10
  walk_length: 40
  window_size: 2
  iterations: 3
  random_seed: 597832

umap:
  enabled: true # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: true
  raw_entities: true
  top_level_nodes: true

embed_graph, graphml, raw_entities, umap, and top_level_nodes are not being generated.

github-actions[bot] commented 2 months ago

This issue has been marked stale due to inactivity after repo maintainer or community member responses that request more information or suggest a solution. It will be closed after five additional days.

ZhengRui commented 1 month ago

Having a similar issue, default settings.yaml works fine. I tried the prompt tuning, and put the tuned prompts inside prompts_tuned folder, I copied the settings.yaml to settings_prompts_tuned.yaml and update all the prompt, cache, output paths, when I index there are two issues: 1. empty workflow, 2. indexing-engine.log still generated inside output folder instead of output_prompts_tuned folder, while logs.json is generated inside output_prompts_tuned folder.

ZhengRui commented 1 month ago

After a bit debugging, I found that --config and --overlay-defaults have to be used together, only use --config will cause empty workflow issue. also indexing-engine.log path is hard coded into output folder in _enable_logging() function. My experiment is based at commit c749fe2.