microsoft / graphrag

A modular graph-based Retrieval-Augmented Generation (RAG) system
https://microsoft.github.io/graphrag/
MIT License
17.1k stars 1.61k forks source link

[Bug]: Pipeline Fails with --emit json Option: Unable to Find create_base_text_units.parquet Despite create_base_text_units.json Existing #871

Open 6ixGODD opened 1 month ago

6ixGODD commented 1 month ago

Do you need to file an issue?

Describe the bug

When using the command poetry run poe index --verbose --emit json to index, set json format in the emit field, the pipeline fails at the create_base_text_units. The error message said that it cannot find create_base_text_units.parquet, even though create_base_text_units.json exists in the output directory.

Steps to reproduce

Expected Behavior

The pipeline should recognize and use the create_base_text_units.json file in the output directory when --emit json is specified.

GraphRAG Config Used

# Paste your config here

encoding_model: cl100k_base
skip_workflows: []
llm:
  api_key: ${GRAPHRAG_API_KEY}
  type: openai_chat # or azure_openai_chat
  model: gpt-4o-mini  # TODO: Change
  model_supports_json: true # recommended if this is available for your model.
  # max_tokens: 4000
  # request_timeout: 180.0
  # api_base: https://<instance>.openai.azure.com
  # api_version: 2024-02-15-preview
  # organization: <organization_id>
  # deployment_name: <azure_model_deployment_name>
  tokens_per_minute: 150000 # set a leaky bucket throttle TODO: Change
  requests_per_minute: 10000 # set a leaky bucket throttle TODO: Change
  max_retries: 10 # TODO: Change
  max_retry_wait: 10.0 # TODO: Change
  # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
  # concurrent_requests: 25 # the number of parallel inflight requests that may be made
  # temperature: 0 # temperature for sampling
  # top_p: 1 # top-p sampling
  # n: 1 # Number of completions to generate

parallelization:
  stagger: 0.3
  # num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:
  ## parallelization: override the global parallelization settings for embeddings
  async_mode: threaded # or asyncio
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: openai_embedding # or azure_openai_embedding
    model: text-embedding-3-small
    # api_base: https://<instance>.openai.azure.com
    # api_version: 2024-02-15-preview
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>
    # tokens_per_minute: 150_000 # set a leaky bucket throttle
    # requests_per_minute: 10_000 # set a leaky bucket throttle
    # max_retries: 10
    # max_retry_wait: 10.0
    # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
    # concurrent_requests: 25 # the number of parallel inflight requests that may be made
    # batch_size: 16 # the number of documents to send in a single request
    # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
    # target: required # or optional

chunks:
  size: 600 # TODO: Change
  overlap: 100
  group_by_columns: [id] # by default, we don't allow chunks to cross documents

input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

cache:
  type: file # or blob
  base_dir: "cache"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

storage:
  type: file # or blob
  base_dir: "output/${timestamp}/artifacts"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

reporting:
  type: file # or console, blob
  base_dir: "output/${timestamp}/reports"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

entity_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 1

summarize_descriptions:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  # enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 1

community_reports:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes
  # num_walks: 10
  # walk_length: 40
  # window_size: 2
  # iterations: 3
  # random_seed: 597832

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: false
  raw_entities: false
  top_level_nodes: false

local_search:
  # text_unit_prop: 0.5
  # community_prop: 0.1
  # conversation_history_max_turns: 5
  # top_k_mapped_entities: 10
  # top_k_relationships: 10
  # llm_temperature: 0 # temperature for sampling
  # llm_top_p: 1 # top-p sampling
  # llm_n: 1 # Number of completions to generate
  # max_tokens: 12000

global_search:
  # llm_temperature: 0 # temperature for sampling
  # llm_top_p: 1 # top-p sampling
  # llm_n: 1 # Number of completions to generate
  # max_tokens: 12000
  # data_max_tokens: 12000
  # map_max_tokens: 1000
  # reduce_max_tokens: 2000
  # concurrency: 32

Logs and screenshots

Console: image

Logs: indexing-engine.log

20:15:01,324 asyncio DEBUG Using proactor: IocpProactor
20:15:01,336 graphrag.config.read_dotenv INFO Loading pipeline .env file
20:15:01,340 graphrag.index.cli INFO using default configuration: {
    "llm": {
        "api_key": "REDACTED, length 56",
        "type": "openai_chat",
        "model": "gpt-4o-mini",
        "max_tokens": 4000,
        "temperature": 0.0,
        "top_p": 1.0,
        "n": 1,
        "request_timeout": 180.0,
        "api_base": null,
        "api_version": null,
        "proxy": null,
        "cognitive_services_endpoint": null,
        "deployment_name": null,
        "model_supports_json": true,
        "tokens_per_minute": 150000,
        "requests_per_minute": 10000,
        "max_retries": 10,
        "max_retry_wait": 10.0,
        "sleep_on_rate_limit_recommendation": true,
        "concurrent_requests": 25
    },
    "parallelization": {
        "stagger": 0.3,
        "num_threads": 50
    },
    "async_mode": "threaded",
    "root_dir": ".",
    "reporting": {
        "type": "file",
        "base_dir": "output/${timestamp}/reports",
        "storage_account_blob_url": null
    },
    "storage": {
        "type": "file",
        "base_dir": "output/${timestamp}/artifacts",
        "storage_account_blob_url": null
    },
    "cache": {
        "type": "file",
        "base_dir": "cache",
        "storage_account_blob_url": null
    },
    "input": {
        "type": "file",
        "file_type": "text",
        "base_dir": "input",
        "storage_account_blob_url": null,
        "encoding": "utf-8",
        "file_pattern": ".*\\.txt$",
        "file_filter": null,
        "source_column": null,
        "timestamp_column": null,
        "timestamp_format": null,
        "text_column": "text",
        "title_column": null,
        "document_attribute_columns": []
    },
    "embed_graph": {
        "enabled": false,
        "num_walks": 10,
        "walk_length": 40,
        "window_size": 2,
        "iterations": 3,
        "random_seed": 597832,
        "strategy": null
    },
    "embeddings": {
        "llm": {
            "api_key": "REDACTED, length 56",
            "type": "openai_embedding",
            "model": "text-embedding-3-small",
            "max_tokens": 4000,
            "temperature": 0,
            "top_p": 1,
            "n": 1,
            "request_timeout": 180.0,
            "api_base": null,
            "api_version": null,
            "proxy": null,
            "cognitive_services_endpoint": null,
            "deployment_name": null,
            "model_supports_json": null,
            "tokens_per_minute": 0,
            "requests_per_minute": 0,
            "max_retries": 10,
            "max_retry_wait": 10.0,
            "sleep_on_rate_limit_recommendation": true,
            "concurrent_requests": 25
        },
        "parallelization": {
            "stagger": 0.3,
            "num_threads": 50
        },
        "async_mode": "threaded",
        "batch_size": 16,
        "batch_max_tokens": 8191,
        "target": "required",
        "skip": [],
        "vector_store": null,
        "strategy": null
    },
    "chunks": {
        "size": 600,
        "overlap": 100,
        "group_by_columns": [
            "id"
        ],
        "strategy": null,
        "encoding_model": null
    },
    "snapshots": {
        "graphml": false,
        "raw_entities": false,
        "top_level_nodes": false
    },
    "entity_extraction": {
        "llm": {
            "api_key": "REDACTED, length 56",
            "type": "openai_chat",
            "model": "gpt-4o-mini",
            "max_tokens": 4000,
            "temperature": 0.0,
            "top_p": 1.0,
            "n": 1,
            "request_timeout": 180.0,
            "api_base": null,
            "api_version": null,
            "proxy": null,
            "cognitive_services_endpoint": null,
            "deployment_name": null,
            "model_supports_json": true,
            "tokens_per_minute": 150000,
            "requests_per_minute": 10000,
            "max_retries": 10,
            "max_retry_wait": 10.0,
            "sleep_on_rate_limit_recommendation": true,
            "concurrent_requests": 25
        },
        "parallelization": {
            "stagger": 0.3,
            "num_threads": 50
        },
        "async_mode": "threaded",
        "prompt": "prompts/entity_extraction.txt",
        "entity_types": [
            "organization",
            "person",
            "geo",
            "event"
        ],
        "max_gleanings": 1,
        "strategy": null,
        "encoding_model": null
    },
    "summarize_descriptions": {
        "llm": {
            "api_key": "REDACTED, length 56",
            "type": "openai_chat",
            "model": "gpt-4o-mini",
            "max_tokens": 4000,
            "temperature": 0.0,
            "top_p": 1.0,
            "n": 1,
            "request_timeout": 180.0,
            "api_base": null,
            "api_version": null,
            "proxy": null,
            "cognitive_services_endpoint": null,
            "deployment_name": null,
            "model_supports_json": true,
            "tokens_per_minute": 150000,
            "requests_per_minute": 10000,
            "max_retries": 10,
            "max_retry_wait": 10.0,
            "sleep_on_rate_limit_recommendation": true,
            "concurrent_requests": 25
        },
        "parallelization": {
            "stagger": 0.3,
            "num_threads": 50
        },
        "async_mode": "threaded",
        "prompt": "prompts/summarize_descriptions.txt",
        "max_length": 500,
        "strategy": null
    },
    "community_reports": {
        "llm": {
            "api_key": "REDACTED, length 56",
            "type": "openai_chat",
            "model": "gpt-4o-mini",
            "max_tokens": 4000,
            "temperature": 0.0,
            "top_p": 1.0,
            "n": 1,
            "request_timeout": 180.0,
            "api_base": null,
            "api_version": null,
            "proxy": null,
            "cognitive_services_endpoint": null,
            "deployment_name": null,
            "model_supports_json": true,
            "tokens_per_minute": 150000,
            "requests_per_minute": 10000,
            "max_retries": 10,
            "max_retry_wait": 10.0,
            "sleep_on_rate_limit_recommendation": true,
            "concurrent_requests": 25
        },
        "parallelization": {
            "stagger": 0.3,
            "num_threads": 50
        },
        "async_mode": "threaded",
        "prompt": "prompts/community_report.txt",
        "max_length": 2000,
        "max_input_length": 8000,
        "strategy": null
    },
    "claim_extraction": {
        "llm": {
            "api_key": "REDACTED, length 56",
            "type": "openai_chat",
            "model": "gpt-4o-mini",
            "max_tokens": 4000,
            "temperature": 0.0,
            "top_p": 1.0,
            "n": 1,
            "request_timeout": 180.0,
            "api_base": null,
            "api_version": null,
            "proxy": null,
            "cognitive_services_endpoint": null,
            "deployment_name": null,
            "model_supports_json": true,
            "tokens_per_minute": 150000,
            "requests_per_minute": 10000,
            "max_retries": 10,
            "max_retry_wait": 10.0,
            "sleep_on_rate_limit_recommendation": true,
            "concurrent_requests": 25
        },
        "parallelization": {
            "stagger": 0.3,
            "num_threads": 50
        },
        "async_mode": "threaded",
        "enabled": false,
        "prompt": "prompts/claim_extraction.txt",
        "description": "Any claims or facts that could be relevant to information discovery.",
        "max_gleanings": 1,
        "strategy": null,
        "encoding_model": null
    },
    "cluster_graph": {
        "max_cluster_size": 10,
        "strategy": null
    },
    "umap": {
        "enabled": false
    },
    "local_search": {
        "text_unit_prop": 0.5,
        "community_prop": 0.1,
        "conversation_history_max_turns": 5,
        "top_k_entities": 10,
        "top_k_relationships": 10,
        "temperature": 0.0,
        "top_p": 1.0,
        "n": 1,
        "max_tokens": 12000,
        "llm_max_tokens": 2000
    },
    "global_search": {
        "temperature": 0.0,
        "top_p": 1.0,
        "n": 1,
        "max_tokens": 12000,
        "data_max_tokens": 12000,
        "map_max_tokens": 1000,
        "reduce_max_tokens": 2000,
        "concurrency": 32
    },
    "encoding_model": "cl100k_base",
    "skip_workflows": []
}
20:15:01,364 graphrag.index.create_pipeline_config INFO Using LLM Config {
    "api_key": "*****",
    "type": "openai_chat",
    "model": "gpt-4o-mini",
    "max_tokens": 4000,
    "temperature": 0.0,
    "top_p": 1.0,
    "n": 1,
    "request_timeout": 180.0,
    "api_base": null,
    "api_version": null,
    "organization": null,
    "proxy": null,
    "cognitive_services_endpoint": null,
    "deployment_name": null,
    "model_supports_json": true,
    "tokens_per_minute": 150000,
    "requests_per_minute": 10000,
    "max_retries": 10,
    "max_retry_wait": 10.0,
    "sleep_on_rate_limit_recommendation": true,
    "concurrent_requests": 25
}
20:15:01,364 graphrag.index.create_pipeline_config INFO Using Embeddings Config {
    "api_key": "*****",
    "type": "openai_embedding",
    "model": "text-embedding-3-small",
    "max_tokens": 4000,
    "temperature": 0,
    "top_p": 1,
    "n": 1,
    "request_timeout": 180.0,
    "api_base": null,
    "api_version": null,
    "organization": null,
    "proxy": null,
    "cognitive_services_endpoint": null,
    "deployment_name": null,
    "model_supports_json": null,
    "tokens_per_minute": 0,
    "requests_per_minute": 0,
    "max_retries": 10,
    "max_retry_wait": 10.0,
    "sleep_on_rate_limit_recommendation": true,
    "concurrent_requests": 25
}
20:15:01,366 graphrag.index.create_pipeline_config INFO skipping workflows 
20:15:01,457 graphrag.index.run INFO Running pipeline
20:15:01,457 graphrag.index.storage.file_pipeline_storage INFO Creating file storage at output\20240808-201501\artifacts
20:15:01,458 graphrag.index.input.load_input INFO loading input from root_dir=input
20:15:01,459 graphrag.index.input.load_input INFO using file storage for input
20:15:01,460 graphrag.index.storage.file_pipeline_storage INFO search input for files matching .*\.txt$
20:15:01,461 graphrag.index.input.text INFO found text files from input, found [('��ҽ���ѧ���ﲿ����������.txt', {})]
20:15:01,464 graphrag.index.input.text INFO Found 1 files, loading 1
20:15:01,466 graphrag.index.workflows.load INFO Workflow Run Order: ['create_base_text_units', 'create_base_extracted_entities', 'create_summarized_entities', 'create_base_entity_graph', 'create_final_entities', 'create_final_nodes', 'create_final_communities', 'join_text_units_to_entity_ids', 'create_final_relationships', 'join_text_units_to_relationship_ids', 'create_final_community_reports', 'create_final_text_units', 'create_base_documents', 'create_final_documents']
20:15:01,466 graphrag.index.run INFO Final # of rows loaded: 1
20:15:01,618 graphrag.index.run INFO Running workflow: create_base_text_units...
20:15:01,618 graphrag.index.run INFO dependencies for create_base_text_units: []
20:15:01,622 datashaper.workflow.workflow INFO executing verb orderby
20:15:01,627 datashaper.workflow.workflow INFO executing verb zip
20:15:01,631 datashaper.workflow.workflow INFO executing verb aggregate_override
20:15:01,642 datashaper.workflow.workflow INFO executing verb chunk
20:15:01,824 datashaper.workflow.workflow INFO executing verb select
20:15:01,830 datashaper.workflow.workflow INFO executing verb unroll
20:15:01,838 datashaper.workflow.workflow INFO executing verb rename
20:15:01,842 datashaper.workflow.workflow INFO executing verb genid
20:15:01,848 datashaper.workflow.workflow INFO executing verb unzip
20:15:01,853 datashaper.workflow.workflow INFO executing verb copy
20:15:01,857 datashaper.workflow.workflow INFO executing verb filter
20:15:01,888 graphrag.index.run DEBUG first row of create_base_text_units => {"id":"624d2f0eb938fbf37a6e0b818f91e50a","chunk":"\u4e8c\u3001\u671b\u9762\u8272\n\u671b\u9762\u8272\uff0c\u662f\u533b\u751f\u89c2\u5bdf\u60a3\u8005\u9762\u90e8\u989c\u8272\u4e0e\u5149\u6cfd\u3002\u989c\u8272\u5c31\u662f\u8272\u8c03\u53d8\u5316,\u5149\u6cfd\u5219\u662f\u660e\u5ea6\u53d8\u5316\u3002\u53e4 \u4eba\u628a\u989c\u8272\u5206\u4e3a\u4e94\u79cd\uff0c\u5373\u9752\u3001\u8d64\u3001\u9ec4\u3001\u767d\u3001\u9ed1,\u79f0\u4e3a\u4e94\u8272\u8bca\u3002\u4e94\u8272\u7684\u53d8\u5316\uff0c\u4ee5\u9762\u90e8\u8868\u73b0\u6700\u4e3a\u660e\u663e\u3002 \u56e0\u6b64\uff0c\u672c\u4e66\u4ee5\u671b\u9762\u8272\u6765\u9610\u8ff0\u4e94\u8272\u8bca\u7684\u5185\u5bb9\u3002\n\u636e\u9634\u9633\u4e94\u884c\u548c\u810f\u8c61\u5b66\u8bf4\u7684\u7406\u8bba,\u4e94\u810f\u5e94\u4e94\u8272\u662f:\u9752\u5e94\u809d,\u8d64\u5e94\u5fc3,\u9ec4\u5e94\u813e\uff0c\u767d\u5e94\u80ba,\u9ed1\u5e94\u80be\u3002\n\uff08-\uff09\u9762\u90e8\u4e0e\u810f\u8151\u76f8\u5173\u90e8\u4f4d\n\u9762\u90e8\u7684\u5404\u90e8\u4f4d\u5206\u5c5e\u810f\u8151,\u662f\u9762\u90e8\u671b\u8bca\u7684\u57fa\u7840\u3002\u8272\u4e0e\u90e8\u4f4d\u7ed3\u5408\u8d77\u6765\uff0c\u66f4\u80fd\u8fdb\u4e00\u6b65\u4e86\u89e3\u75c5\u60c5\u3002\n\u9762\u90e8\u5206\u810f\u8151\u90e8\u4f4d:\u6839\u636e\u300a\u7075\u67a2\u2022\u4e94\u8272\u300b\u7684\u5206\u6cd5\uff0c\u628a\u6574\u4e2a\u9762\u90e8\u7684\u540d\u79f0\u5206\u4e3a\uff1a\u9f3b\u2014\u2014\u660e\u5802\uff0c\u7709 \u95f4\u4e00\u9619\uff0c\u989d\u2014\u2014\u5ead\uff08\u989c\uff09\uff0c\u988a\u4fa7\u2014\u2014\u85e9\uff0c\u8033\u95e8\u2014\u2014\u853d\n\u6309\u7167\u4e0a\u8ff0\u540d\u79f0\u548c\u4e94\u810f\u76f8\u5173\u7684\u4f4d\u7f6e\u662f\uff1a\u5ead\u2014\u2014\u9996\u9762\uff0c\u9619\u4e0a\u2014\u2014\u54bd\u5589\uff0c\u9619\u4e2d\uff08\u5370\u5802\uff09\u2014\u2014\u80ba\uff0c \u9619\u4e0b\uff08\u4e0b\u6781\uff0c\u5c71\u6839\uff09 0,\u4e0b\u6781\u4e4b\u4e0b\uff08\u5e74\u5bff\uff09\u2014\u2014\u809d\uff0c\u809d\u90e8\u5de6\u53f3\u2014\u2014\u80c6\uff0c\u809d\u4e0b\uff08\u51c6\u5934\uff09\u4e00\u813e\uff0c \u65b9\u4e0a\uff08\u813e\u4e24\u65c1\uff09\u2014\u2014\u80c3\uff0c\u4e2d\u592e\uff08\u989d\u4e0b\uff09\u2014\u2014\u5927\u80a0\uff0c\u631f\u5927\u80a0\u2014\u2014\u80be\uff0c\u660e\u5802\uff08\u9f3b\u7aef\uff09\u4ee5\u4e0a\u2014\u2014\u5c0f\u80a0\uff0c\u660e \u5802\u4ee5\u4e0b\u2014\u2014\u8180\u80f1\u5b50\u5904\uff08\u56fe2-2\uff09\u3002\n\u53e6\u5916,\u300a\u7d20\u95ee\u2022\u523a\u70ed\u7bc7\u300b\u628a\u4e94\u810f\u4e0e\u9762\u90e8\u76f8\u5173\u90e8\u4f4d\uff0c\u5212\u5206\u4e3a\uff1a\n\u5de6\u988a\u2014\u2014\u809d,\u53f3\u988a\u2014\u2014\u80ba\uff0c\u989d\u2014\u2014\u5fc3,\u987b\u2014\u2014\u80be,\u9f3b\u2014\u2014\u813e\u3002\n\u4ee5\u4e0a\u4e24\u79cd\u65b9\u6cd5\uff0c\u539f\u5219\u4e0a\u4ee5\u524d\u4e00\u79cd\u4e3a\u4e3b\u8981\u4f9d\u636e,\u540e\u4e00\u79cd\u53ef\u4f5c\u4e34\u5e8a\u53c2\u8003\u3002\n\uff08\u56db\uff09\u5e38","chunk_id":"624d2f0eb938fbf37a6e0b818f91e50a","document_ids":["e0bd1fc8d7cf72e91cf530c38e315d74"],"n_tokens":600}
20:15:01,888 graphrag.index.emit.json_table_emitter INFO emitting JSON table create_base_text_units.json
20:15:02,82 graphrag.index.run INFO Running workflow: create_base_extracted_entities...
20:15:02,83 graphrag.index.run INFO dependencies for create_base_extracted_entities: ['create_base_text_units']
20:15:02,83 graphrag.index.run ERROR error running workflow create_base_extracted_entities
Traceback (most recent call last):
  File "D:\WorkSpace\GZUCM\graphrag\graphrag\index\run.py", line 320, in run_pipeline
    await inject_workflow_data_dependencies(workflow)
  File "D:\WorkSpace\GZUCM\graphrag\graphrag\index\run.py", line 256, in inject_workflow_data_dependencies
    table = await load_table_from_storage(f"{id}.parquet")
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\WorkSpace\GZUCM\graphrag\graphrag\index\run.py", line 242, in load_table_from_storage
    raise ValueError(msg)
ValueError: Could not find create_base_text_units.parquet in storage!
20:15:02,84 graphrag.index.reporting.file_workflow_callbacks INFO Error running pipeline! details=None

logs.json

{"type": "error", "data": "Error running pipeline!", "stack": "Traceback (most recent call last):\n  File \"D:\\WorkSpace\\GZUCM\\graphrag\\graphrag\\index\\run.py\", line 320, in run_pipeline\n    await inject_workflow_data_dependencies(workflow)\n  File \"D:\\WorkSpace\\GZUCM\\graphrag\\graphrag\\index\\run.py\", line 256, in inject_workflow_data_dependencies\n    table = await load_table_from_storage(f\"{id}.parquet\")\n            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"D:\\WorkSpace\\GZUCM\\graphrag\\graphrag\\index\\run.py\", line 242, in load_table_from_storage\n    raise ValueError(msg)\nValueError: Could not find create_base_text_units.parquet in storage!\n", "source": "Could not find create_base_text_units.parquet in storage!", "details": null}

Additional Information

night666e commented 1 month ago

蹲,解决了吗

9prodhi commented 1 month ago

Update to Improve Indexing Process

The following changes in the run.py file will help complete the indexing process:

  1. Modified load_table_from_storage function to handle JSON files:

    async def load_table_from_storage(name: str) -> pd.DataFrame:
        if not await storage.has(name):
            msg = f"Could not find {name} in storage!"
            raise ValueError(msg)
        try:
            log.info("read table from storage: %s", name)
            # Read JSON data instead of Parquet
            content = await storage.get(name, encoding='utf-8')
            json_data = [json.loads(line) for line in content.splitlines() if line.strip()]
            return pd.DataFrame(json_data)
        except Exception:
            log.exception("error loading table from storage: %s", name)
            raise
  2. Updated inject_workflow_data_dependencies function to use JSON files:

    
    async def inject_workflow_data_dependencies(workflow: Workflow) -> None:
       workflow.add_table(DEFAULT_INPUT_NAME, dataset)
       deps = workflow_dependencies[workflow.name]
       log.info("dependencies for %s: %s", workflow.name, deps)
       for id in deps:
           workflow_id = f"workflow:{id}"
           # Load JSON file instead of Parquet
           table = await load_table_from_storage(f"{id}.json")
           workflow.add_table(workflow_id, table)

The run_global_search and run_local_search functions still need to be updated to remove the .parquet hardcoding for the query functionality to work.

Will work on these functions update here once done.

fantom845 commented 1 month ago

Update to run.py to run querying process while using --emit json

Replace the following functions with below implementations to the run.py file in the query folder tested on local using the documentation example and it works perfectly fine.

  1. run_local_search

    
    def run_local_search(
    data_dir: str | None,
    root_dir: str | None,
    community_level: int,
    response_type: str,
    query: str,
    ):
    """Run a local search with the given query."""
    data_dir, root_dir, config = _configure_paths_and_settings(data_dir, root_dir)
    data_path = Path(data_dir)
    
    def read_json_file(file_path):
        with open(file_path, 'r') as f:
            return pd.DataFrame([json.loads(line) for line in f if line.strip()])
    
    final_nodes = read_json_file(data_path / "create_final_nodes.json")
    final_community_reports = read_json_file(data_path / "create_final_community_reports.json")
    final_text_units = read_json_file(data_path / "create_final_text_units.json")
    final_relationships = read_json_file(data_path / "create_final_relationships.json")
    final_entities = read_json_file(data_path / "create_final_entities.json")
    final_covariates_path = data_path / "create_final_covariates.json"
    final_covariates = read_json_file(final_covariates_path) if final_covariates_path.exists() else None
    
    vector_store_args = config.embeddings.vector_store if config.embeddings.vector_store else {}
    vector_store_type = vector_store_args.get("type", VectorStoreType.LanceDB)
    
    description_embedding_store = __get_embedding_description_store(
        vector_store_type=vector_store_type,
        config_args=vector_store_args,
    )
    entities = read_indexer_entities(final_nodes, final_entities, community_level)
    store_entity_semantic_embeddings(
        entities=entities, vectorstore=description_embedding_store
    )
    covariates = read_indexer_covariates(final_covariates) if final_covariates is not None else []
    
    search_engine = get_local_search_engine(
        config,
        reports=read_indexer_reports(
            final_community_reports, final_nodes, community_level
        ),
        text_units=read_indexer_text_units(final_text_units),
        entities=entities,
        relationships=read_indexer_relationships(final_relationships),
        covariates={"claims": covariates},
        description_embedding_store=description_embedding_store,
        response_type=response_type,
    )
    
    result = search_engine.search(query=query)
    reporter.success(f"Local Search Response: {result.response}")
    return result.response

2. ```run_global_search```
````python
def run_global_search(
    data_dir: str | None,
    root_dir: str | None,
    community_level: int,
    response_type: str,
    query: str,
):
    """Run a global search with the given query."""
    data_dir, root_dir, config = _configure_paths_and_settings(data_dir, root_dir)
    data_path = Path(data_dir)

    def read_json_file(file_path):
        with open(file_path, 'r') as f:
            return pd.DataFrame([json.loads(line) for line in f if line.strip()])

    final_nodes: pd.DataFrame = read_json_file(data_path / "create_final_nodes.json")
    final_entities: pd.DataFrame = read_json_file(data_path / "create_final_entities.json")
    final_community_reports: pd.DataFrame = read_json_file(data_path / "create_final_community_reports.json")

    reports = read_indexer_reports(
        final_community_reports, final_nodes, community_level
    )
    entities = read_indexer_entities(final_nodes, final_entities, community_level)
    search_engine = get_global_search_engine(
        config,
        reports=reports,
        entities=entities,
        response_type=response_type,
    )

    result = search_engine.search(query=query)

    reporter.success(f"Global Search Response: {result.response}")
    return result.response
Lycnkd commented 1 day ago

The original source code hardcoded the emit type in some functions, causing the --emit option to have no effect in these functions. Are the two solutions mentioned above part of an official update? The code I pulled in September doesn't seem to have these changes.