6ixGODD commented 1 month ago

Do you need to file an issue?

[X] I have searched the existing issues and this bug is not already filed.
[X] My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
[X] I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the bug

When using the command poetry run poe index --verbose --emit json to index, set json format in the emit field, the pipeline fails at the create_base_text_units. The error message said that it cannot find create_base_text_units.parquet, even though create_base_text_units.json exists in the output directory.

Steps to reproduce

after initialize, run

poetry run poe index --verbose --emit json

the pipeline output:

⠏ GraphRAG Indexer
├── Loading Input (InputFileType.text) - 1 files loaded (0 filtered) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
└── create_base_text_units
❌ Errors occurred during the pipeline run, see logs for more details.

check the logs for details:

{"type": "error", "data": "Error running pipeline!", "stack": "Traceback (most recent call last):\n  File \"D:\\WorkSpace\\GZUCM\\graphrag\\graphrag\\index\\run.py\", line 320, in run_pipeline\n    await inject_workflow_data_dependencies(workflow)\n  File \"D:\\WorkSpace\\GZUCM\\graphrag\\graphrag\\index\\run.py\", line 256, in inject_workflow_data_dependencies\n    table = await load_table_from_storage(f\"{id}.parquet\")\n            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"D:\\WorkSpace\\GZUCM\\graphrag\\graphrag\\index\\run.py\", line 242, in load_table_from_storage\n    raise ValueError(msg)\nValueError: Could not find create_base_text_units.parquet in storage!\n", "source": "Could not find create_base_text_units.parquet in storage!", "details": null}

Expected Behavior

The pipeline should recognize and use the create_base_text_units.json file in the output directory when --emit json is specified.

GraphRAG Config Used

# Paste your config here

encoding_model: cl100k_base
skip_workflows: []
llm:
  api_key: ${GRAPHRAG_API_KEY}
  type: openai_chat # or azure_openai_chat
  model: gpt-4o-mini  # TODO: Change
  model_supports_json: true # recommended if this is available for your model.
  # max_tokens: 4000
  # request_timeout: 180.0
  # api_base: https://<instance>.openai.azure.com
  # api_version: 2024-02-15-preview
  # organization: <organization_id>
  # deployment_name: <azure_model_deployment_name>
  tokens_per_minute: 150000 # set a leaky bucket throttle TODO: Change
  requests_per_minute: 10000 # set a leaky bucket throttle TODO: Change
  max_retries: 10 # TODO: Change
  max_retry_wait: 10.0 # TODO: Change
  # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
  # concurrent_requests: 25 # the number of parallel inflight requests that may be made
  # temperature: 0 # temperature for sampling
  # top_p: 1 # top-p sampling
  # n: 1 # Number of completions to generate

parallelization:
  stagger: 0.3
  # num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:
  ## parallelization: override the global parallelization settings for embeddings
  async_mode: threaded # or asyncio
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: openai_embedding # or azure_openai_embedding
    model: text-embedding-3-small
    # api_base: https://<instance>.openai.azure.com
    # api_version: 2024-02-15-preview
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>
    # tokens_per_minute: 150_000 # set a leaky bucket throttle
    # requests_per_minute: 10_000 # set a leaky bucket throttle
    # max_retries: 10
    # max_retry_wait: 10.0
    # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
    # concurrent_requests: 25 # the number of parallel inflight requests that may be made
    # batch_size: 16 # the number of documents to send in a single request
    # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
    # target: required # or optional

chunks:
  size: 600 # TODO: Change
  overlap: 100
  group_by_columns: [id] # by default, we don't allow chunks to cross documents

input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

cache:
  type: file # or blob
  base_dir: "cache"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

storage:
  type: file # or blob
  base_dir: "output/${timestamp}/artifacts"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

reporting:
  type: file # or console, blob
  base_dir: "output/${timestamp}/reports"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

entity_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 1

summarize_descriptions:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  # enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 1

community_reports:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes
  # num_walks: 10
  # walk_length: 40
  # window_size: 2
  # iterations: 3
  # random_seed: 597832

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: false
  raw_entities: false
  top_level_nodes: false

local_search:
  # text_unit_prop: 0.5
  # community_prop: 0.1
  # conversation_history_max_turns: 5
  # top_k_mapped_entities: 10
  # top_k_relationships: 10
  # llm_temperature: 0 # temperature for sampling
  # llm_top_p: 1 # top-p sampling
  # llm_n: 1 # Number of completions to generate
  # max_tokens: 12000

global_search:
  # llm_temperature: 0 # temperature for sampling
  # llm_top_p: 1 # top-p sampling
  # llm_n: 1 # Number of completions to generate
  # max_tokens: 12000
  # data_max_tokens: 12000
  # map_max_tokens: 1000
  # reduce_max_tokens: 2000
  # concurrency: 32

Logs and screenshots

Console:

Logs: indexing-engine.log

20:15:01,324 asyncio DEBUG Using proactor: IocpProactor
20:15:01,336 graphrag.config.read_dotenv INFO Loading pipeline .env file
20:15:01,340 graphrag.index.cli INFO using default configuration: {
    "llm": {
        "api_key": "REDACTED, length 56",
        "type": "openai_chat",
        "model": "gpt-4o-mini",
        "max_tokens": 4000,
        "temperature": 0.0,
        "top_p": 1.0,
        "n": 1,
        "request_timeout": 180.0,
        "api_base": null,
        "api_version": null,
        "proxy": null,
        "cognitive_services_endpoint": null,
        "deployment_name": null,
        "model_supports_json": true,
        "tokens_per_minute": 150000,
        "requests_per_minute": 10000,
        "max_retries": 10,
        "max_retry_wait": 10.0,
        "sleep_on_rate_limit_recommendation": true,
        "concurrent_requests": 25
    },
    "parallelization": {
        "stagger": 0.3,
        "num_threads": 50
    },
    "async_mode": "threaded",
    "root_dir": ".",
    "reporting": {
        "type": "file",
        "base_dir": "output/${timestamp}/reports",
        "storage_account_blob_url": null
    },
    "storage": {
        "type": "file",
        "base_dir": "output/${timestamp}/artifacts",
        "storage_account_blob_url": null
    },
    "cache": {
        "type": "file",
        "base_dir": "cache",
        "storage_account_blob_url": null
    },
    "input": {
        "type": "file",
        "file_type": "text",
        "base_dir": "input",
        "storage_account_blob_url": null,
        "encoding": "utf-8",
        "file_pattern": ".*\\.txt$",
        "file_filter": null,
        "source_column": null,
        "timestamp_column": null,
        "timestamp_format": null,
        "text_column": "text",
        "title_column": null,
        "document_attribute_columns": []
    },
    "embed_graph": {
        "enabled": false,
        "num_walks": 10,
        "walk_length": 40,
        "window_size": 2,
        "iterations": 3,
        "random_seed": 597832,
        "strategy": null
    },
    "embeddings": {
        "llm": {
            "api_key": "REDACTED, length 56",
            "type": "openai_embedding",
            "model": "text-embedding-3-small",
            "max_tokens": 4000,
            "temperature": 0,
            "top_p": 1,
            "n": 1,
            "request_timeout": 180.0,
            "api_base": null,
            "api_version": null,
            "proxy": null,
            "cognitive_services_endpoint": null,
            "deployment_name": null,
            "model_supports_json": null,
            "tokens_per_minute": 0,
            "requests_per_minute": 0,
            "max_retries": 10,
            "max_retry_wait": 10.0,
            "sleep_on_rate_limit_recommendation": true,
            "concurrent_requests": 25
        },
        "parallelization": {
            "stagger": 0.3,
            "num_threads": 50
        },
        "async_mode": "threaded",
        "batch_size": 16,
        "batch_max_tokens": 8191,
        "target": "required",
        "skip": [],
        "vector_store": null,
        "strategy": null
    },
    "chunks": {
        "size": 600,
        "overlap": 100,
        "group_by_columns": [
            "id"
        ],
        "strategy": null,
        "encoding_model": null
    },
    "snapshots": {
        "graphml": false,
        "raw_entities": false,
        "top_level_nodes": false
    },
    "entity_extraction": {
        "llm": {
            "api_key": "REDACTED, length 56",
            "type": "openai_chat",
            "model": "gpt-4o-mini",
            "max_tokens": 4000,
            "temperature": 0.0,
            "top_p": 1.0,
            "n": 1,
            "request_timeout": 180.0,
            "api_base": null,
            "api_version": null,
            "proxy": null,
            "cognitive_services_endpoint": null,
            "deployment_name": null,
            "model_supports_json": true,
            "tokens_per_minute": 150000,
            "requests_per_minute": 10000,
            "max_retries": 10,
            "max_retry_wait": 10.0,
            "sleep_on_rate_limit_recommendation": true,
            "concurrent_requests": 25
        },
        "parallelization": {
            "stagger": 0.3,
            "num_threads": 50
        },
        "async_mode": "threaded",
        "prompt": "prompts/entity_extraction.txt",
        "entity_types": [
            "organization",
            "person",
            "geo",
            "event"
        ],
        "max_gleanings": 1,
        "strategy": null,
        "encoding_model": null
    },
    "summarize_descriptions": {
        "llm": {
            "api_key": "REDACTED, length 56",
            "type": "openai_chat",
            "model": "gpt-4o-mini",
            "max_tokens": 4000,
            "temperature": 0.0,
            "top_p": 1.0,
            "n": 1,
            "request_timeout": 180.0,
            "api_base": null,
            "api_version": null,
            "proxy": null,
            "cognitive_services_endpoint": null,
            "deployment_name": null,
            "model_supports_json": true,
            "tokens_per_minute": 150000,
            "requests_per_minute": 10000,
            "max_retries": 10,
            "max_retry_wait": 10.0,
            "sleep_on_rate_limit_recommendation": true,
            "concurrent_requests": 25
        },
        "parallelization": {
            "stagger": 0.3,
            "num_threads": 50
        },
        "async_mode": "threaded",
        "prompt": "prompts/summarize_descriptions.txt",
        "max_length": 500,
        "strategy": null
    },
    "community_reports": {
        "llm": {
            "api_key": "REDACTED, length 56",
            "type": "openai_chat",
            "model": "gpt-4o-mini",
            "max_tokens": 4000,
            "temperature": 0.0,
            "top_p": 1.0,
            "n": 1,
            "request_timeout": 180.0,
            "api_base": null,
            "api_version": null,
            "proxy": null,
            "cognitive_services_endpoint": null,
            "deployment_name": null,
            "model_supports_json": true,
            "tokens_per_minute": 150000,
            "requests_per_minute": 10000,
            "max_retries": 10,
            "max_retry_wait": 10.0,
            "sleep_on_rate_limit_recommendation": true,
            "concurrent_requests": 25
        },
        "parallelization": {
            "stagger": 0.3,
            "num_threads": 50
        },
        "async_mode": "threaded",
        "prompt": "prompts/community_report.txt",
        "max_length": 2000,
        "max_input_length": 8000,
        "strategy": null
    },
    "claim_extraction": {
        "llm": {
            "api_key": "REDACTED, length 56",
            "type": "openai_chat",
            "model": "gpt-4o-mini",
            "max_tokens": 4000,
            "temperature": 0.0,
            "top_p": 1.0,
            "n": 1,
            "request_timeout": 180.0,
            "api_base": null,
            "api_version": null,
            "proxy": null,
            "cognitive_services_endpoint": null,
            "deployment_name": null,
            "model_supports_json": true,
            "tokens_per_minute": 150000,
            "requests_per_minute": 10000,
            "max_retries": 10,
            "max_retry_wait": 10.0,
            "sleep_on_rate_limit_recommendation": true,
            "concurrent_requests": 25
        },
        "parallelization": {
            "stagger": 0.3,
            "num_threads": 50
        },
        "async_mode": "threaded",
        "enabled": false,
        "prompt": "prompts/claim_extraction.txt",
        "description": "Any claims or facts that could be relevant to information discovery.",
        "max_gleanings": 1,
        "strategy": null,
        "encoding_model": null
    },
    "cluster_graph": {
        "max_cluster_size": 10,
        "strategy": null
    },
    "umap": {
        "enabled": false
    },
    "local_search": {
        "text_unit_prop": 0.5,
        "community_prop": 0.1,
        "conversation_history_max_turns": 5,
        "top_k_entities": 10,
        "top_k_relationships": 10,
        "temperature": 0.0,
        "top_p": 1.0,
        "n": 1,
        "max_tokens": 12000,
        "llm_max_tokens": 2000
    },
    "global_search": {
        "temperature": 0.0,
        "top_p": 1.0,
        "n": 1,
        "max_tokens": 12000,
        "data_max_tokens": 12000,
        "map_max_tokens": 1000,
        "reduce_max_tokens": 2000,
        "concurrency": 32
    },
    "encoding_model": "cl100k_base",
    "skip_workflows": []
}
20:15:01,364 graphrag.index.create_pipeline_config INFO Using LLM Config {
    "api_key": "*****",
    "type": "openai_chat",
    "model": "gpt-4o-mini",
    "max_tokens": 4000,
    "temperature": 0.0,
    "top_p": 1.0,
    "n": 1,
    "request_timeout": 180.0,
    "api_base": null,
    "api_version": null,
    "organization": null,
    "proxy": null,
    "cognitive_services_endpoint": null,
    "deployment_name": null,
    "model_supports_json": true,
    "tokens_per_minute": 150000,
    "requests_per_minute": 10000,
    "max_retries": 10,
    "max_retry_wait": 10.0,
    "sleep_on_rate_limit_recommendation": true,
    "concurrent_requests": 25
}
20:15:01,364 graphrag.index.create_pipeline_config INFO Using Embeddings Config {
    "api_key": "*****",
    "type": "openai_embedding",
    "model": "text-embedding-3-small",
    "max_tokens": 4000,
    "temperature": 0,
    "top_p": 1,
    "n": 1,
    "request_timeout": 180.0,
    "api_base": null,
    "api_version": null,
    "organization": null,
    "proxy": null,
    "cognitive_services_endpoint": null,
    "deployment_name": null,
    "model_supports_json": null,
    "tokens_per_minute": 0,
    "requests_per_minute": 0,
    "max_retries": 10,
    "max_retry_wait": 10.0,
    "sleep_on_rate_limit_recommendation": true,
    "concurrent_requests": 25
}
20:15:01,366 graphrag.index.create_pipeline_config INFO skipping workflows 
20:15:01,457 graphrag.index.run INFO Running pipeline
20:15:01,457 graphrag.index.storage.file_pipeline_storage INFO Creating file storage at output\20240808-201501\artifacts
20:15:01,458 graphrag.index.input.load_input INFO loading input from root_dir=input
20:15:01,459 graphrag.index.input.load_input INFO using file storage for input
20:15:01,460 graphrag.index.storage.file_pipeline_storage INFO search input for files matching .*\.txt$
20:15:01,461 graphrag.index.input.text INFO found text files from input, found [('��ҽ���ѧ���ﲿ����������.txt', {})]
20:15:01,464 graphrag.index.input.text INFO Found 1 files, loading 1
20:15:01,466 graphrag.index.workflows.load INFO Workflow Run Order: ['create_base_text_units', 'create_base_extracted_entities', 'create_summarized_entities', 'create_base_entity_graph', 'create_final_entities', 'create_final_nodes', 'create_final_communities', 'join_text_units_to_entity_ids', 'create_final_relationships', 'join_text_units_to_relationship_ids', 'create_final_community_reports', 'create_final_text_units', 'create_base_documents', 'create_final_documents']
20:15:01,466 graphrag.index.run INFO Final # of rows loaded: 1
20:15:01,618 graphrag.index.run INFO Running workflow: create_base_text_units...
20:15:01,618 graphrag.index.run INFO dependencies for create_base_text_units: []
20:15:01,622 datashaper.workflow.workflow INFO executing verb orderby
20:15:01,627 datashaper.workflow.workflow INFO executing verb zip
20:15:01,631 datashaper.workflow.workflow INFO executing verb aggregate_override
20:15:01,642 datashaper.workflow.workflow INFO executing verb chunk
20:15:01,824 datashaper.workflow.workflow INFO executing verb select
20:15:01,830 datashaper.workflow.workflow INFO executing verb unroll
20:15:01,838 datashaper.workflow.workflow INFO executing verb rename
20:15:01,842 datashaper.workflow.workflow INFO executing verb genid
20:15:01,848 datashaper.workflow.workflow INFO executing verb unzip
20:15:01,853 datashaper.workflow.workflow INFO executing verb copy
20:15:01,857 datashaper.workflow.workflow INFO executing verb filter
20:15:01,888 graphrag.index.run DEBUG first row of create_base_text_units => {"id":"624d2f0eb938fbf37a6e0b818f91e50a","chunk":"\u4e8c\u3001\u671b\u9762\u8272\n\u671b\u9762\u8272\uff0c\u662f\u533b\u751f\u89c2\u5bdf\u60a3\u8005\u9762\u90e8\u989c\u8272\u4e0e\u5149\u6cfd\u3002\u989c\u8272\u5c31\u662f\u8272\u8c03\u53d8\u5316,\u5149\u6cfd\u5219\u662f\u660e\u5ea6\u53d8\u5316\u3002\u53e4 \u4eba\u628a\u989c\u8272\u5206\u4e3a\u4e94\u79cd\uff0c\u5373\u9752\u3001\u8d64\u3001\u9ec4\u3001\u767d\u3001\u9ed1,\u79f0\u4e3a\u4e94\u8272\u8bca\u3002\u4e94\u8272\u7684\u53d8\u5316\uff0c\u4ee5\u9762\u90e8\u8868\u73b0\u6700\u4e3a\u660e\u663e\u3002 \u56e0\u6b64\uff0c\u672c\u4e66\u4ee5\u671b\u9762\u8272\u6765\u9610\u8ff0\u4e94\u8272\u8bca\u7684\u5185\u5bb9\u3002\n\u636e\u9634\u9633\u4e94\u884c\u548c\u810f\u8c61\u5b66\u8bf4\u7684\u7406\u8bba,\u4e94\u810f\u5e94\u4e94\u8272\u662f:\u9752\u5e94\u809d,\u8d64\u5e94\u5fc3,\u9ec4\u5e94\u813e\uff0c\u767d\u5e94\u80ba,\u9ed1\u5e94\u80be\u3002\n\uff08-\uff09\u9762\u90e8\u4e0e\u810f\u8151\u76f8\u5173\u90e8\u4f4d\n\u9762\u90e8\u7684\u5404\u90e8\u4f4d\u5206\u5c5e\u810f\u8151,\u662f\u9762\u90e8\u671b\u8bca\u7684\u57fa\u7840\u3002\u8272\u4e0e\u90e8\u4f4d\u7ed3\u5408\u8d77\u6765\uff0c\u66f4\u80fd\u8fdb\u4e00\u6b65\u4e86\u89e3\u75c5\u60c5\u3002\n\u9762\u90e8\u5206\u810f\u8151\u90e8\u4f4d:\u6839\u636e\u300a\u7075\u67a2\u2022\u4e94\u8272\u300b\u7684\u5206\u6cd5\uff0c\u628a\u6574\u4e2a\u9762\u90e8\u7684\u540d\u79f0\u5206\u4e3a\uff1a\u9f3b\u2014\u2014\u660e\u5802\uff0c\u7709 \u95f4\u4e00\u9619\uff0c\u989d\u2014\u2014\u5ead\uff08\u989c\uff09\uff0c\u988a\u4fa7\u2014\u2014\u85e9\uff0c\u8033\u95e8\u2014\u2014\u853d\n\u6309\u7167\u4e0a\u8ff0\u540d\u79f0\u548c\u4e94\u810f\u76f8\u5173\u7684\u4f4d\u7f6e\u662f\uff1a\u5ead\u2014\u2014\u9996\u9762\uff0c\u9619\u4e0a\u2014\u2014\u54bd\u5589\uff0c\u9619\u4e2d\uff08\u5370\u5802\uff09\u2014\u2014\u80ba\uff0c \u9619\u4e0b\uff08\u4e0b\u6781\uff0c\u5c71\u6839\uff09 0,\u4e0b\u6781\u4e4b\u4e0b\uff08\u5e74\u5bff\uff09\u2014\u2014\u809d\uff0c\u809d\u90e8\u5de6\u53f3\u2014\u2014\u80c6\uff0c\u809d\u4e0b\uff08\u51c6\u5934\uff09\u4e00\u813e\uff0c \u65b9\u4e0a\uff08\u813e\u4e24\u65c1\uff09\u2014\u2014\u80c3\uff0c\u4e2d\u592e\uff08\u989d\u4e0b\uff09\u2014\u2014\u5927\u80a0\uff0c\u631f\u5927\u80a0\u2014\u2014\u80be\uff0c\u660e\u5802\uff08\u9f3b\u7aef\uff09\u4ee5\u4e0a\u2014\u2014\u5c0f\u80a0\uff0c\u660e \u5802\u4ee5\u4e0b\u2014\u2014\u8180\u80f1\u5b50\u5904\uff08\u56fe2-2\uff09\u3002\n\u53e6\u5916,\u300a\u7d20\u95ee\u2022\u523a\u70ed\u7bc7\u300b\u628a\u4e94\u810f\u4e0e\u9762\u90e8\u76f8\u5173\u90e8\u4f4d\uff0c\u5212\u5206\u4e3a\uff1a\n\u5de6\u988a\u2014\u2014\u809d,\u53f3\u988a\u2014\u2014\u80ba\uff0c\u989d\u2014\u2014\u5fc3,\u987b\u2014\u2014\u80be,\u9f3b\u2014\u2014\u813e\u3002\n\u4ee5\u4e0a\u4e24\u79cd\u65b9\u6cd5\uff0c\u539f\u5219\u4e0a\u4ee5\u524d\u4e00\u79cd\u4e3a\u4e3b\u8981\u4f9d\u636e,\u540e\u4e00\u79cd\u53ef\u4f5c\u4e34\u5e8a\u53c2\u8003\u3002\n\uff08\u56db\uff09\u5e38","chunk_id":"624d2f0eb938fbf37a6e0b818f91e50a","document_ids":["e0bd1fc8d7cf72e91cf530c38e315d74"],"n_tokens":600}
20:15:01,888 graphrag.index.emit.json_table_emitter INFO emitting JSON table create_base_text_units.json
20:15:02,82 graphrag.index.run INFO Running workflow: create_base_extracted_entities...
20:15:02,83 graphrag.index.run INFO dependencies for create_base_extracted_entities: ['create_base_text_units']
20:15:02,83 graphrag.index.run ERROR error running workflow create_base_extracted_entities
Traceback (most recent call last):
  File "D:\WorkSpace\GZUCM\graphrag\graphrag\index\run.py", line 320, in run_pipeline
    await inject_workflow_data_dependencies(workflow)
  File "D:\WorkSpace\GZUCM\graphrag\graphrag\index\run.py", line 256, in inject_workflow_data_dependencies
    table = await load_table_from_storage(f"{id}.parquet")
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\WorkSpace\GZUCM\graphrag\graphrag\index\run.py", line 242, in load_table_from_storage
    raise ValueError(msg)
ValueError: Could not find create_base_text_units.parquet in storage!
20:15:02,84 graphrag.index.reporting.file_workflow_callbacks INFO Error running pipeline! details=None

logs.json

{"type": "error", "data": "Error running pipeline!", "stack": "Traceback (most recent call last):\n  File \"D:\\WorkSpace\\GZUCM\\graphrag\\graphrag\\index\\run.py\", line 320, in run_pipeline\n    await inject_workflow_data_dependencies(workflow)\n  File \"D:\\WorkSpace\\GZUCM\\graphrag\\graphrag\\index\\run.py\", line 256, in inject_workflow_data_dependencies\n    table = await load_table_from_storage(f\"{id}.parquet\")\n            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"D:\\WorkSpace\\GZUCM\\graphrag\\graphrag\\index\\run.py\", line 242, in load_table_from_storage\n    raise ValueError(msg)\nValueError: Could not find create_base_text_units.parquet in storage!\n", "source": "Could not find create_base_text_units.parquet in storage!", "details": null}

Additional Information

GraphRAG Version: 0.2.0
Operating System: Windows 11
Python Version: 3.11
Related Issues:

night666e commented 1 month ago

蹲，解决了吗

9prodhi commented 1 month ago

Update to Improve Indexing Process

The following changes in the run.py file will help complete the indexing process:

Modified load_table_from_storage function to handle JSON files:

async def load_table_from_storage(name: str) -> pd.DataFrame:
    if not await storage.has(name):
        msg = f"Could not find {name} in storage!"
        raise ValueError(msg)
    try:
        log.info("read table from storage: %s", name)
        # Read JSON data instead of Parquet
        content = await storage.get(name, encoding='utf-8')
        json_data = [json.loads(line) for line in content.splitlines() if line.strip()]
        return pd.DataFrame(json_data)
    except Exception:
        log.exception("error loading table from storage: %s", name)
        raise

Updated inject_workflow_data_dependencies function to use JSON files:


async def inject_workflow_data_dependencies(workflow: Workflow) -> None:
   workflow.add_table(DEFAULT_INPUT_NAME, dataset)
   deps = workflow_dependencies[workflow.name]
   log.info("dependencies for %s: %s", workflow.name, deps)
   for id in deps:
       workflow_id = f"workflow:{id}"
       # Load JSON file instead of Parquet
       table = await load_table_from_storage(f"{id}.json")
       workflow.add_table(workflow_id, table)

The run_global_search and run_local_search functions still need to be updated to remove the .parquet hardcoding for the query functionality to work.

Will work on these functions update here once done.

fantom845 commented 1 month ago

Update to run.py to run querying process while using --emit json

Replace the following functions with below implementations to the run.py file in the query folder tested on local using the documentation example and it works perfectly fine.

run_local_search


def run_local_search(
data_dir: str | None,
root_dir: str | None,
community_level: int,
response_type: str,
query: str,
):
"""Run a local search with the given query."""
data_dir, root_dir, config = _configure_paths_and_settings(data_dir, root_dir)
data_path = Path(data_dir)

def read_json_file(file_path):
    with open(file_path, 'r') as f:
        return pd.DataFrame([json.loads(line) for line in f if line.strip()])

final_nodes = read_json_file(data_path / "create_final_nodes.json")
final_community_reports = read_json_file(data_path / "create_final_community_reports.json")
final_text_units = read_json_file(data_path / "create_final_text_units.json")
final_relationships = read_json_file(data_path / "create_final_relationships.json")
final_entities = read_json_file(data_path / "create_final_entities.json")
final_covariates_path = data_path / "create_final_covariates.json"
final_covariates = read_json_file(final_covariates_path) if final_covariates_path.exists() else None

vector_store_args = config.embeddings.vector_store if config.embeddings.vector_store else {}
vector_store_type = vector_store_args.get("type", VectorStoreType.LanceDB)

description_embedding_store = __get_embedding_description_store(
    vector_store_type=vector_store_type,
    config_args=vector_store_args,
)
entities = read_indexer_entities(final_nodes, final_entities, community_level)
store_entity_semantic_embeddings(
    entities=entities, vectorstore=description_embedding_store
)
covariates = read_indexer_covariates(final_covariates) if final_covariates is not None else []

search_engine = get_local_search_engine(
    config,
    reports=read_indexer_reports(
        final_community_reports, final_nodes, community_level
    ),
    text_units=read_indexer_text_units(final_text_units),
    entities=entities,
    relationships=read_indexer_relationships(final_relationships),
    covariates={"claims": covariates},
    description_embedding_store=description_embedding_store,
    response_type=response_type,
)

result = search_engine.search(query=query)
reporter.success(f"Local Search Response: {result.response}")
return result.response


2. ```run_global_search```
````python
def run_global_search(
    data_dir: str | None,
    root_dir: str | None,
    community_level: int,
    response_type: str,
    query: str,
):
    """Run a global search with the given query."""
    data_dir, root_dir, config = _configure_paths_and_settings(data_dir, root_dir)
    data_path = Path(data_dir)

    def read_json_file(file_path):
        with open(file_path, 'r') as f:
            return pd.DataFrame([json.loads(line) for line in f if line.strip()])

    final_nodes: pd.DataFrame = read_json_file(data_path / "create_final_nodes.json")
    final_entities: pd.DataFrame = read_json_file(data_path / "create_final_entities.json")
    final_community_reports: pd.DataFrame = read_json_file(data_path / "create_final_community_reports.json")

    reports = read_indexer_reports(
        final_community_reports, final_nodes, community_level
    )
    entities = read_indexer_entities(final_nodes, final_entities, community_level)
    search_engine = get_global_search_engine(
        config,
        reports=reports,
        entities=entities,
        response_type=response_type,
    )

    result = search_engine.search(query=query)

    reporter.success(f"Global Search Response: {result.response}")
    return result.response

Lycnkd commented 1 day ago

The original source code hardcoded the emit type in some functions, causing the --emit option to have no effect in these functions. Are the two solutions mentioned above part of an official update? The code I pulled in September doesn't seem to have these changes.

microsoft / graphrag

[Bug]: Pipeline Fails with --emit json Option: Unable to Find create_base_text_units.parquet Despite create_base_text_units.json Existing #871