microsoft / graphrag

A modular graph-based Retrieval-Augmented Generation (RAG) system
https://microsoft.github.io/graphrag/
MIT License
17.06k stars 1.6k forks source link

[Issue]: Error executing verb \"cluster_graph\" in create_base_entity_graph: Columns must be same length as key" #974

Open xiangjingwei123 opened 3 weeks ago

xiangjingwei123 commented 3 weeks ago

Do you need to file an issue?

Describe the issue

{"type": "error", "data": "Error executing verb \"cluster_graph\" in create_base_entity_graph: Columns must be same length as key", "stack": "Traceback (most recent call last):\n File \"/data/jupyter/myenv/lib/python3.10/site-packages/datashaper/workflow/workflow.py\", line 410, in _execute_verb\n result = node.verb.func(verb_args)\n File \"/data/jupyter/myenv/lib/python3.10/site-packages/graphrag/index/verbs/graph/clustering/cluster_graph.py\", line 102, in cluster_graph\n output_df[[level_to, to]] = pd.DataFrame(\n File \"/data/jupyter/myenv/lib/python3.10/site-packages/pandas/core/frame.py\", line 4299, in setitem\n self._setitem_array(key, value)\n File \"/data/jupyter/myenv/lib/python3.10/site-packages/pandas/core/frame.py\", line 4341, in _setitem_array\n check_key_length(self.columns, key, value)\n File \"/data/jupyter/myenv/lib/python3.10/site-packages/pandas/core/indexers/utils.py\", line 390, in check_key_length\n raise ValueError(\"Columns must be same length as key\")\nValueError: Columns must be same length as key\n", "source": "Columns must be same length as key", "details": null}{"type": "error", "data": "Error running pipeline!", "stack": "Traceback (most recent call last):\n File \"/data/jupyter/myenv/lib/python3.10/site-packages/graphrag/index/run.py\", line 325, in run_pipeline\n result = await workflow.run(context, callbacks)\n File \"/data/jupyter/myenv/lib/python3.10/site-packages/datashaper/workflow/workflow.py\", line 369, in run\n timing = await self._execute_verb(node, context, callbacks)\n File \"/data/jupyter/myenv/lib/python3.10/site-packages/datashaper/workflow/workflow.py\", line 410, in _execute_verb\n result = node.verb.func(verb_args)\n File \"/data/jupyter/myenv/lib/python3.10/site-packages/graphrag/index/verbs/graph/clustering/cluster_graph.py\", line 102, in cluster_graph\n output_df[[level_to, to]] = pd.DataFrame(\n File \"/data/jupyter/myenv/lib/python3.10/site-packages/pandas/core/frame.py\", line 4299, in setitem\n self._setitem_array(key, value)\n File \"/data/jupyter/myenv/lib/python3.10/site-packages/pandas/core/frame.py\", line 4341, in _setitem_array\n check_key_length(self.columns, key, value)\n File \"/data/jupyter/myenv/lib/python3.10/site-packages/pandas/core/indexers/utils.py\", line 390, in check_key_length\n raise ValueError(\"Columns must be same length as key\")\nValueError: Columns must be same length as key\n", "source": "Columns must be same length as key", "details": null}

Steps to reproduce

after i excute 'python -m graphrag.index --root ./ragtest' , the failed happend.

GraphRAG Config Used

root@notebook-service-v1-pro-616-bc885c877-xjmrb:/data/jupyter/ragtest# cat settings.yaml 

encoding_model: cl100k_base
skip_workflows: []
llm:
  api_key: ${GRAPHRAG_API_KEY}
  type: openai_chat # or azure_openai_chat
  model: gpt-4o
  model_supports_json: true # recommended if this is available for your model.
  max_tokens: 1000
  # request_timeout: 180.0
  # api_base: https://<instance>.openai.azure.com
  # api_version: 2024-02-15-preview
  # organization: <organization_id>
  # deployment_name: <azure_model_deployment_name>
  # tokens_per_minute: 150_000 # set a leaky bucket throttle
  # requests_per_minute: 10_000 # set a leaky bucket throttle
  # max_retries: 10
  # max_retry_wait: 10.0
  # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
  # concurrent_requests: 25 # the number of parallel inflight requests that may be made
  # temperature: 0 # temperature for sampling
  # top_p: 1 # top-p sampling
  # n: 1 # Number of completions to generate

parallelization:
  stagger: 0.3
  # num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:
  ## parallelization: override the global parallelization settings for embeddings
  async_mode: threaded # or asyncio
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: openai_embedding # or azure_openai_embedding
    model: text-embedding-3-small
    # api_base: https://<instance>.openai.azure.com
    # api_version: 2024-02-15-preview
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>
    # tokens_per_minute: 150_000 # set a leaky bucket throttle
    # requests_per_minute: 10_000 # set a leaky bucket throttle
    # max_retries: 10
    # max_retry_wait: 10.0
    # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
    # concurrent_requests: 25 # the number of parallel inflight requests that may be made
    # batch_size: 16 # the number of documents to send in a single request
    # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
    # target: required # or optional

chunks:
  size: 1200
  overlap: 100
  group_by_columns: [id] # by default, we don't allow chunks to cross documents

input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

cache:
  type: file # or blob
  base_dir: "cache"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

storage:
  type: file # or blob
  base_dir: "output/${timestamp}/artifacts"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

reporting:
  type: file # or console, blob
  base_dir: "output/${timestamp}/reports"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

entity_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 1

summarize_descriptions:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  # enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 1

community_reports:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes
  # num_walks: 10
  # walk_length: 40
  # window_size: 2
  # iterations: 3
  # random_seed: 597832

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: false
  raw_entities: false
  top_level_nodes: false

local_search:
  # text_unit_prop: 0.5
  # community_prop: 0.1
  # conversation_history_max_turns: 5
  # top_k_mapped_entities: 10
  # top_k_relationships: 10
  # llm_temperature: 0 # temperature for sampling
  # llm_top_p: 1 # top-p sampling
  # llm_n: 1 # Number of completions to generate
  # max_tokens: 12000

global_search:
  # llm_temperature: 0 # temperature for sampling
  # llm_top_p: 1 # top-p sampling
  # llm_n: 1 # Number of completions to generate
  # max_tokens: 12000
  # data_max_tokens: 12000
  # map_max_tokens: 1000
  # reduce_max_tokens: 2000
  # concurrency: 32

Logs and screenshots

No response

Additional Information

natoverse commented 3 weeks ago

Please inspect the indexing-engine.log. Often this error is preceded by errors earlier in the pipeline, usually due to OpenAI key issues such as permissions or missing config.

xiangjingwei123 commented 3 weeks ago

indexing-engine.log 17:35:14,702 graphrag.index.emit.parquet_table_emitter INFO emitting parquet table create_base_text_units.parquet 17:35:15,63 graphrag.index.run INFO Running workflow: create_base_extracted_entities... 17:35:15,69 graphrag.index.run INFO dependencies for create_base_extracted_entities: ['create_base_text_units'] 17:35:15,81 graphrag.index.run INFO read table from storage: create_base_text_units.parquet 17:35:15,130 datashaper.workflow.workflow INFO executing verb entity_extract 17:35:15,141 graphrag.llm.openai.create_openai_client INFO Creating OpenAI client base_url=None 17:35:15,164 graphrag.index.llm.load_llm INFO create TPM/RPM limiter for gpt-4o: TPM=0, RPM=0 17:35:15,164 graphrag.index.llm.load_llm INFO create concurrency limiter for gpt-4o: 25 17:37:22,715 graphrag.index.reporting.file_workflow_callbacks INFO Error Invoking LLM details={'input': '\n-Goal-\nGiven a text document that is potentially relevant to this activity and a list of entity types, identify all entities of those types from the text and all relationships among the identified entities.\n \n-Steps-\n1. Identify all entities. For each identified entity, extract the following information:\n- entity_name: Name of the entity, capitalized\n- entity_type: One of the following types: [organization,person,geo,event]\n- entity_description: Comprehensive description of the entity\'s attributes and activities\nFormat each entity as ("entity"<|><|><|>)\n \n2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are clearly related to each other.\nFor each pair of related entities, extract the following information:\n- source_entity: name of the source entity, as identified in step 1\n- target_entity: name of the target entity, as identified in step 1\n- relationship_description: explanation as to why you think the source entity and the target entity are related to each other\n- relationship_strength: a numeric score indicating strength of the relationship between the source entity and target entity\n Format each relationship as ("relationship"<|><|><|><|>)\n \n3. Return output in English as a single list of all the entities and relationships identified in steps 1 and 2. Use ## as the list delimiter.\n \n4. When finished, output <|COMPLETE|>\n \n######################\n-Examples-\n######################\nExample 1:\nEntity_types: ORGANIZATION,PERSON\nText:\nThe Verdantis\'s Central Institution is scheduled to meet on Monday and Thursday, with the institution planning to release its latest policy decision on Thursday at 1:30 p.m. PDT, followed by a press conference where Central Institution Chair Martin Smith will take questions. Investors expect the Market Strategy Committee to hold its benchmark interest rate steady in a range of 3.5%-3.75%.\n######################\nOutput:\n("entity"<|>CENTRAL INSTITUTION<|>ORGANIZATION<|>The Central Institution is the Federal Reserve of Verdantis, which is setting interest rates on Monday and Thursday)\n##\n("entity"<|>MARTIN SMITH<|>PERSON<|>Martin Smith is the chair of the Central Institution)\n##\n("entity"<|>MARKET STRATEGY COMMITTEE<|>ORGANIZATION<|>The Central Institution committee makes key decisions about interest rates and the growth of Verdantis\'s money supply)\n##\n("relationship"<|>MARTIN SMITH<|>CENTRAL INSTITUTION<|>Martin Smith is the Chair of the Central Institution and will answer questions at a press conference<|>9)\n<|COMPLETE|>\n\n######################\nExample 2:\nEntity_types: ORGANIZATION\nText:\nTechGlobal\'s (TG) stock skyrocketed in its opening day on the Global Exchange Thursday. But IPO experts warn that the semiconductor corporation\'s debut on the public markets isn\'t indicative of how other newly listed companies may perform.\n\nTechGlobal, a formerly public company, was taken private by Vision Holdings in 2014. The well-established chip designer says it powers 85% of premium smartphones.\n######################\nOutput:\n("entity"<|>TECHGLOBAL<|>ORGANIZATION<|>TechGlobal is a stock now listed on the Global Exchange which powers 85% of premium smartphones)\n##\n("entity"<|>VISION HOLDINGS<|>ORGANIZATION<|>Vision Holdings is a firm that previously owned TechGlobal)\n##\n("relationship"<|>TECHGLOBAL<|>VISION HOLDINGS<|>Vision Holdings formerly owned TechGlobal from 2014 until present<|>5)\n<|COMPLETE|>\n\n######################\nExample 3:\nEntity_types: ORGANIZATION,GEO,PERSON\nText:\nFive Aurelians jailed for 8 years in Firuzabad and widely regarded as hostages are on their way home to Aurelia.\n\nThe swap orchestrated by Quintara was finalized when $8bn of Firuzi funds were transferred to financial institutions in Krohaara, the capital of Quintara.\n\nThe exchange initiated in Firuzabad\'s capital, Tiruzia, led to the four men and one woman, who are also Firuzi nationals, boarding a chartered flight to Krohaara.\n\nThey were welcomed by senior Aurelian officials and are now on their way to Aurelia\'s capital, Cashion.\n\nThe Aurelians include 39-year-old businessman Samuel Namara, who has been held in Tiruzia\'s Alhamia Prison, as well as journalist Durke Bataglani, 59, and environmentalist Meggie Tazbah, 53, who also holds Bratinas nationality.\n######################\nOutput:\n("entity"<|>FIRUZABAD<|>GEO<|>Firuzabad held Aurelians as hostages)\n##\n("entity"<|>AURELIA<|>GEO<|>Country seeking to release hostages)\n##\n("entity"<|>QUINTARA<|>GEO<|>Country that negotiated a swap of money in exchange for hostages)\n##\n##\n("entity"<|>TIRUZIA<|>GEO<|>Capital of Firuzabad where the Aurelians were being held)\n##\n("entity"<|>KROHAARA<|>GEO<|>Capital city in Quintara)\n##\n("entity"<|>CASHION<|>GEO<|>Capital city in Aurelia)\n##\n("entity"<|>SAMUEL NAMARA<|>PERSON<|>Aurelian who spent time in Tiruzia\'s Alhamia Prison)\n##\n("entity"<|>ALHAMIA PRISON<|>GEO<|>Prison in Tiruzia)\n##\n("entity"<|>DURKE BATAGLANI<|>PERSON<|>Aurelian journalist who was held hostage)\n##\n("entity"<|>MEGGIE TAZBAH<|>PERSON<|>Bratinas national and environmentalist who was held hostage)\n##\n("relationship"<|>FIRUZABAD<|>AURELIA<|>Firuzabad negotiated a hostage exchange with Aurelia<|>2)\n##\n("relationship"<|>QUINTARA<|>AURELIA<|>Quintara brokered the hostage exchange between Firuzabad and Aurelia<|>2)\n##\n("relationship"<|>QUINTARA<|>FIRUZABAD<|>Quintara brokered the hostage exchange between Firuzabad and Aurelia<|>2)\n##\n("relationship"<|>SAMUEL NAMARA<|>ALHAMIA PRISON<|>Samuel Namara was a prisoner at Alhamia prison<|>8)\n##\n("relationship"<|>SAMUEL NAMARA<|>MEGGIE TAZBAH<|>Samuel Namara and Meggie Tazbah were exchanged in the same hostage release<|>2)\n##\n("relationship"<|>SAMUEL NAMARA<|>DURKE BATAGLANI<|>Samuel Namara and Durke Bataglani were exchanged in the same hostage release<|>2)\n##\n("relationship"<|>MEGGIE TAZBAH<|>DURKE BATAGLANI<|>Meggie Tazbah and Durke Bataglani were exchanged in the same hostage release<|>2)\n##\n("relationship"<|>SAMUEL NAMARA<|>FIRUZABAD<|>Samuel Namara was a hostage in Firuzabad<|>2)\n##\n("relationship"<|>MEGGIE TAZBAH<|>FIRUZABAD<|>Meggie Tazbah was a hostage in Firuzabad<|>2)\n##\n("relationship"<|>DURKE BATAGLANI<|>FIRUZABAD<|>Durke Bataglani was a hostage in Firuzabad<|>2)\n<|COMPLETE|>\n\n######################\n-Real Data-\n######################\nEntity_types: organization,person,geo,event\nText: .\n Mr. Fezziwig, a kind-hearted, jovial old merchant.\n Fred, Scrooge\'s nephew.\n Ghost of Christmas Past, a phantom showing things past.\n Ghost of Christmas Present, a spirit of a kind, generous,\n and hearty nature.\n Ghost of Christmas Yet to Come, an apparition showing the shadows\n of things which yet may happen.\n Ghost of Jacob Marley, a spectre of Scrooge\'s former partner in business.\n Joe, a marine-store dealer and receiver of stolen goods.\n Ebenezer Scrooge, a grasping, covetous old man, the surviving partner\n of the firm of Scrooge and Marley.\n Mr. Topper, a bachelor.\n Dick Wilkins, a fellow apprentice of Scrooge\'s.\n\n Belle, a comely matron, an old sweetheart of Scrooge\'s.\n Caroline, wife of one of Scrooge\'s debtors.\n Mrs. Cratchit, wife of Bob Cratchit.\n Belinda and Martha Cratchit, daughters of the preceding.\n\n Mrs. Dilber, a laundress.\n Fan, the sister of Scrooge.\n Mrs. Fezziwig, the worthy partner of Mr. Fezziwig.\n\n\n\n\n CONTENTS\n\n STAVE ONE--MARLEY\'S GHOST 3\n STAVE TWO--THE FIRST OF THE THREE SPI\n######################\nOutput:'}

Please inspect the indexing-engine.log. Often this error is preceded by errors earlier in the pipeline, usually due to OpenAI key issues such as permissions or missing config.

natoverse commented 3 weeks ago

It looks like there are connection issues with the OpenAI library:

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/local/lib/python3.10/site-packages/graphrag/index/graph/extractors/graph/graph_extractor.py", line 123, in call result = await self._process_document(text, prompt_variables) File "/usr/local/lib/python3.10/site-packages/graphrag/index/graph/extractors/graph/graph_extractor.py", line 151, in _process_document response = await self._llm( File "/usr/local/lib/python3.10/site-packages/graphrag/llm/openai/json_parsing_llm.py", line 34, in call result = await self._delegate(input, kwargs) File "/usr/local/lib/python3.10/site-packages/graphrag/llm/openai/openai_token_replacing_llm.py", line 37, in call return await self._delegate(input, kwargs) File "/usr/local/lib/python3.10/site-packages/graphrag/llm/openai/openai_history_tracking_llm.py", line 33, in call output = await self._delegate(input, kwargs) File "/usr/local/lib/python3.10/site-packages/graphrag/llm/base/caching_llm.py", line 96, in call result = await self._delegate(input, kwargs) File "/usr/local/lib/python3.10/site-packages/graphrag/llm/base/rate_limiting_llm.py", line 177, in call result, start = await execute_with_retry() File "/usr/local/lib/python3.10/site-packages/graphrag/llm/base/rate_limiting_llm.py", line 159, in execute_with_retry async for attempt in retryer: File "/usr/local/lib/python3.10/site-packages/tenacity/asyncio/init.py", line 166, in anext do = await self.iter(retry_state=self._retry_state) File "/usr/local/lib/python3.10/site-packages/tenacity/asyncio/init.py", line 153, in iter result = await action(retry_state) File "/usr/local/lib/python3.10/site-packages/tenacity/_utils.py", line 99, in inner return call(*args, kwargs) File "/usr/local/lib/python3.10/site-packages/tenacity/init.py", line 418, in exc_check raise retry_exc.reraise() File "/usr/local/lib/python3.10/site-packages/tenacity/init.py", line 185, in reraise raise self.last_attempt.result() File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.get_result() File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception File "/usr/local/lib/python3.10/site-packages/graphrag/llm/base/rate_limiting_llm.py", line 165, in execute_with_retry return await do_attempt(), start File "/usr/local/lib/python3.10/site-packages/graphrag/llm/base/rate_limiting_llm.py", line 147, in do_attempt return await self._delegate(input, kwargs) File "/usr/local/lib/python3.10/site-packages/graphrag/llm/base/base_llm.py", line 49, in call return await self._invoke(input, kwargs) File "/usr/local/lib/python3.10/site-packages/graphrag/llm/base/base_llm.py", line 53, in _invoke output = await self._execute_llm(input, kwargs) File "/usr/local/lib/python3.10/site-packages/graphrag/llm/openai/openai_chat_llm.py", line 53, in _execute_llm completion = await self.client.chat.completions.create( File "/usr/local/lib/python3.10/site-packages/openai/resources/chat/completions.py", line 1339, in create return await self._post( File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1816, in post return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls) File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1510, in request return await self._request( File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1583, in _request raise APIConnectionError(request=request) from err openai.APIConnectionError: Connection error.

I don't see any other reporting on the underlying issue that would help with diagnostics (i.e., internet connectivity, key validity, etc.), so the best suggestion at the moment would be to see if you have any difficulty connecting via their API directly and in the playground, etc. to confirm that your setup with them is valid.

Kanishk-T commented 3 weeks ago

Facing the same issue currently. Have validated my key and internet connection by connecting to the API directly. Logs:

12:33:15,301 graphrag.config.read_dotenv INFO Loading pipeline .env file
12:33:15,304 graphrag.index.cli INFO using default configuration: {
    "llm": {
        "api_key": "REDACTED, length 56",
        "type": "openai_chat",
        "model": "gpt-4o-mini",
        "max_tokens": 4000,
        "temperature": 0.0,
        "top_p": 1.0,
        "n": 1,
        "request_timeout": 180.0,
        "api_base": null,
        "api_version": null,
        "proxy": null,
        "cognitive_services_endpoint": null,
        "deployment_name": null,
        "model_supports_json": true,
        "tokens_per_minute": 0,
        "requests_per_minute": 0,
        "max_retries": 10,
        "max_retry_wait": 10.0,
        "sleep_on_rate_limit_recommendation": true,
        "concurrent_requests": 25
    },
    "parallelization": {
        "stagger": 0.3,
        "num_threads": 50
    },
    "async_mode": "threaded",
    "root_dir": ".",
    "reporting": {
        "type": "file",
        "base_dir": "output/${timestamp}/reports",
        "storage_account_blob_url": null
    },
    "storage": {
        "type": "file",
        "base_dir": "output/${timestamp}/artifacts",
        "storage_account_blob_url": null
    },
    "cache": {
        "type": "file",
        "base_dir": "cache",
        "storage_account_blob_url": null
    },
    "input": {
        "type": "file",
        "file_type": "text",
        "base_dir": "input",
        "storage_account_blob_url": null,
        "encoding": "utf-8",
        "file_pattern": ".*\\.txt$",
        "file_filter": null,
        "source_column": null,
        "timestamp_column": null,
        "timestamp_format": null,
        "text_column": "text",
        "title_column": null,
        "document_attribute_columns": []
    },
    "embed_graph": {
        "enabled": false,
        "num_walks": 10,
        "walk_length": 40,
        "window_size": 2,
        "iterations": 3,
        "random_seed": 597832,
        "strategy": null
    },
    "embeddings": {
        "llm": {
            "api_key": "REDACTED, length 56",
            "type": "openai_embedding",
            "model": "text-embedding-3-small",
            "max_tokens": 4000,
            "temperature": 0,
            "top_p": 1,
            "n": 1,
            "request_timeout": 180.0,
            "api_base": null,
            "api_version": null,
            "proxy": null,
            "cognitive_services_endpoint": null,
            "deployment_name": null,
            "model_supports_json": null,
            "tokens_per_minute": 0,
            "requests_per_minute": 0,
            "max_retries": 10,
            "max_retry_wait": 10.0,
            "sleep_on_rate_limit_recommendation": true,
            "concurrent_requests": 25
        },
        "parallelization": {
            "stagger": 0.3,
            "num_threads": 50
        },
        "async_mode": "threaded",
        "batch_size": 16,
        "batch_max_tokens": 8191,
        "target": "required",
        "skip": [],
        "vector_store": null,
        "strategy": null
    },
    "chunks": {
        "size": 1200,
        "overlap": 100,
        "group_by_columns": [
            "id"
        ],
        "strategy": null
    },
    "snapshots": {
        "graphml": false,
        "raw_entities": false,
        "top_level_nodes": false
    },
    "entity_extraction": {
        "llm": {
            "api_key": "REDACTED, length 56",
            "type": "openai_chat",
            "model": "gpt-4o-mini",
            "max_tokens": 4000,
            "temperature": 0.0,
            "top_p": 1.0,
            "n": 1,
            "request_timeout": 180.0,
            "api_base": null,
            "api_version": null,
            "proxy": null,
            "cognitive_services_endpoint": null,
            "deployment_name": null,
            "model_supports_json": true,
            "tokens_per_minute": 0,
            "requests_per_minute": 0,
            "max_retries": 10,
            "max_retry_wait": 10.0,
            "sleep_on_rate_limit_recommendation": true,
            "concurrent_requests": 25
        },
        "parallelization": {
            "stagger": 0.3,
            "num_threads": 50
        },
        "async_mode": "threaded",
        "prompt": "prompts/entity_extraction.txt",
        "entity_types": [
            "organization",
            "person",
            "geo",
            "event"
        ],
        "max_gleanings": 1,
        "strategy": null
    },
    "summarize_descriptions": {
        "llm": {
            "api_key": "REDACTED, length 56",
            "type": "openai_chat",
            "model": "gpt-4o-mini",
            "max_tokens": 4000,
            "temperature": 0.0,
            "top_p": 1.0,
            "n": 1,
            "request_timeout": 180.0,
            "api_base": null,
            "api_version": null,
            "proxy": null,
            "cognitive_services_endpoint": null,
            "deployment_name": null,
            "model_supports_json": true,
            "tokens_per_minute": 0,
            "requests_per_minute": 0,
            "max_retries": 10,
            "max_retry_wait": 10.0,
            "sleep_on_rate_limit_recommendation": true,
            "concurrent_requests": 25
        },
        "parallelization": {
            "stagger": 0.3,
            "num_threads": 50
        },
        "async_mode": "threaded",
        "prompt": "prompts/summarize_descriptions.txt",
        "max_length": 500,
        "strategy": null
    },
    "community_reports": {
        "llm": {
            "api_key": "REDACTED, length 56",
            "type": "openai_chat",
            "model": "gpt-4o-mini",
            "max_tokens": 4000,
            "temperature": 0.0,
            "top_p": 1.0,
            "n": 1,
            "request_timeout": 180.0,
            "api_base": null,
            "api_version": null,
            "proxy": null,
            "cognitive_services_endpoint": null,
            "deployment_name": null,
            "model_supports_json": true,
            "tokens_per_minute": 0,
            "requests_per_minute": 0,
            "max_retries": 10,
            "max_retry_wait": 10.0,
            "sleep_on_rate_limit_recommendation": true,
            "concurrent_requests": 25
        },
        "parallelization": {
            "stagger": 0.3,
            "num_threads": 50
        },
        "async_mode": "threaded",
        "prompt": "prompts/community_report.txt",
        "max_length": 2000,
        "max_input_length": 8000,
        "strategy": null
    },
    "claim_extraction": {
        "llm": {
            "api_key": "REDACTED, length 56",
            "type": "openai_chat",
            "model": "gpt-4o-mini",
            "max_tokens": 4000,
            "temperature": 0.0,
            "top_p": 1.0,
            "n": 1,
            "request_timeout": 180.0,
            "api_base": null,
            "api_version": null,
            "proxy": null,
            "cognitive_services_endpoint": null,
            "deployment_name": null,
            "model_supports_json": true,
            "tokens_per_minute": 0,
            "requests_per_minute": 0,
            "max_retries": 10,
            "max_retry_wait": 10.0,
            "sleep_on_rate_limit_recommendation": true,
            "concurrent_requests": 25
        },
        "parallelization": {
            "stagger": 0.3,
            "num_threads": 50
        },
        "async_mode": "threaded",
        "enabled": false,
        "prompt": "prompts/claim_extraction.txt",
        "description": "Any claims or facts that could be relevant to information discovery.",
        "max_gleanings": 1,
        "strategy": null
    },
    "cluster_graph": {
        "max_cluster_size": 10,
        "strategy": null
    },
    "umap": {
        "enabled": false
    },
    "local_search": {
        "text_unit_prop": 0.5,
        "community_prop": 0.1,
        "conversation_history_max_turns": 5,
        "top_k_entities": 10,
        "top_k_relationships": 10,
        "temperature": 0.0,
        "top_p": 1.0,
        "n": 1,
        "max_tokens": 12000,
        "llm_max_tokens": 2000
    },
    "global_search": {
        "temperature": 0.0,
        "top_p": 1.0,
        "n": 1,
        "max_tokens": 12000,
        "data_max_tokens": 12000,
        "map_max_tokens": 1000,
        "reduce_max_tokens": 2000,
        "concurrency": 32
    },
    "encoding_model": "cl100k_base",
    "skip_workflows": []
}
12:33:15,306 graphrag.index.create_pipeline_config INFO skipping workflows 
12:33:15,318 graphrag.index.run INFO Running pipeline
12:33:15,318 graphrag.index.storage.file_pipeline_storage INFO Creating file storage at output/20240822-123315/artifacts
12:33:15,319 graphrag.index.input.load_input INFO loading input from root_dir=input
12:33:15,319 graphrag.index.input.load_input INFO using file storage for input
12:33:15,319 graphrag.index.storage.file_pipeline_storage INFO search input for files matching .*\.txt$
12:33:15,320 graphrag.index.input.text INFO found text files from input, found [('.txt', {})]
12:33:15,321 graphrag.index.input.text INFO Found 1 files, loading 1
12:33:15,322 graphrag.index.workflows.load INFO Workflow Run Order: ['create_base_text_units', 'create_base_extracted_entities', 'create_summarized_entities', 'create_base_entity_graph', 'create_final_entities', 'create_final_nodes', 'create_final_communities', 'join_text_units_to_entity_ids', 'create_final_relationships', 'join_text_units_to_relationship_ids', 'create_final_community_reports', 'create_final_text_units', 'create_base_documents', 'create_final_documents']
12:33:15,322 graphrag.index.run INFO Final # of rows loaded: 1
12:33:15,432 graphrag.index.run INFO Running workflow: create_base_text_units...
12:33:15,432 graphrag.index.run INFO dependencies for create_base_text_units: []
12:33:15,435 datashaper.workflow.workflow INFO executing verb orderby
12:33:15,437 datashaper.workflow.workflow INFO executing verb zip
12:33:15,439 datashaper.workflow.workflow INFO executing verb aggregate_override
12:33:15,444 datashaper.workflow.workflow INFO executing verb chunk
12:33:15,605 datashaper.workflow.workflow INFO executing verb select
12:33:15,607 datashaper.workflow.workflow INFO executing verb unroll
12:33:15,611 datashaper.workflow.workflow INFO executing verb rename
12:33:15,615 datashaper.workflow.workflow INFO executing verb genid
12:33:15,619 datashaper.workflow.workflow INFO executing verb unzip
12:33:15,623 datashaper.workflow.workflow INFO executing verb copy
12:33:15,626 datashaper.workflow.workflow INFO executing verb filter
12:33:15,635 graphrag.index.emit.parquet_table_emitter INFO emitting parquet table create_base_text_units.parquet
12:33:15,763 graphrag.index.run INFO Running workflow: create_base_extracted_entities...
12:33:15,764 graphrag.index.run INFO dependencies for create_base_extracted_entities: ['create_base_text_units']
12:33:15,764 graphrag.index.run INFO read table from storage: create_base_text_units.parquet
12:33:15,774 datashaper.workflow.workflow INFO executing verb entity_extract
12:33:15,776 graphrag.llm.openai.create_openai_client INFO Creating OpenAI client base_url=None
12:33:15,808 graphrag.index.llm.load_llm INFO create TPM/RPM limiter for gpt-4o-mini: TPM=0, RPM=0
12:33:15,808 graphrag.index.llm.load_llm INFO create concurrency limiter for gpt-4o-mini: 25
12:33:16,983 httpx INFO HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
12:33:16,988 graphrag.llm.base.rate_limiting_llm INFO perf - llm.chat "Process" with 0 retries took 1.1759999999776483. input_tokens=1935, output_tokens=5
12:33:17,641 httpx INFO HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
12:33:17,643 graphrag.llm.base.rate_limiting_llm INFO perf - llm.chat "extract-continuation-0" with 0 retries took 0.651999999769032. input_tokens=19, output_tokens=19
12:33:17,659 datashaper.workflow.workflow INFO executing verb merge_graphs
12:33:17,663 graphrag.index.emit.parquet_table_emitter INFO emitting parquet table create_base_extracted_entities.parquet
12:33:17,787 graphrag.index.run INFO Running workflow: create_summarized_entities...
12:33:17,787 graphrag.index.run INFO dependencies for create_summarized_entities: ['create_base_extracted_entities']
12:33:17,788 graphrag.index.run INFO read table from storage: create_base_extracted_entities.parquet
12:33:17,798 datashaper.workflow.workflow INFO executing verb summarize_descriptions
12:33:17,800 graphrag.index.emit.parquet_table_emitter INFO emitting parquet table create_summarized_entities.parquet
12:33:17,926 graphrag.index.run INFO Running workflow: create_base_entity_graph...
12:33:17,926 graphrag.index.run INFO dependencies for create_base_entity_graph: ['create_summarized_entities']
12:33:17,926 graphrag.index.run INFO read table from storage: create_summarized_entities.parquet
12:33:17,937 datashaper.workflow.workflow INFO executing verb cluster_graph
12:33:17,937 graphrag.index.verbs.graph.clustering.cluster_graph WARNING Graph has no nodes
12:33:17,940 datashaper.workflow.workflow ERROR Error executing verb "cluster_graph" in create_base_entity_graph: Columns must be same length as key
Traceback (most recent call last):
  File "/opt/anaconda3/envs/myenv/lib/python3.12/site-packages/datashaper/workflow/workflow.py", line 410, in _execute_verb
    result = node.verb.func(**verb_args)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kanishktyagi/Kanishk_POC/graphrag/graphrag/graphrag/index/verbs/graph/clustering/cluster_graph.py", line 102, in cluster_graph
    output_df[[level_to, to]] = pd.DataFrame(
    ~~~~~~~~~^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/myenv/lib/python3.12/site-packages/pandas/core/frame.py", line 4299, in __setitem__
    self._setitem_array(key, value)
  File "/opt/anaconda3/envs/myenv/lib/python3.12/site-packages/pandas/core/frame.py", line 4341, in _setitem_array
    check_key_length(self.columns, key, value)
  File "/opt/anaconda3/envs/myenv/lib/python3.12/site-packages/pandas/core/indexers/utils.py", line 390, in check_key_length
    raise ValueError("Columns must be same length as key")
ValueError: Columns must be same length as key
12:33:17,946 graphrag.index.reporting.file_workflow_callbacks INFO Error executing verb "cluster_graph" in create_base_entity_graph: Columns must be same length as key details=None
12:33:17,946 graphrag.index.run ERROR error running workflow create_base_entity_graph
Traceback (most recent call last):
  File "/Users/kanishktyagi/Kanishk_POC/graphrag/graphrag/graphrag/index/run.py", line 323, in run_pipeline
    result = await workflow.run(context, callbacks)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/myenv/lib/python3.12/site-packages/datashaper/workflow/workflow.py", line 369, in run
    timing = await self._execute_verb(node, context, callbacks)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/myenv/lib/python3.12/site-packages/datashaper/workflow/workflow.py", line 410, in _execute_verb
    result = node.verb.func(**verb_args)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kanishktyagi/Kanishk_POC/graphrag/graphrag/graphrag/index/verbs/graph/clustering/cluster_graph.py", line 102, in cluster_graph
    output_df[[level_to, to]] = pd.DataFrame(
    ~~~~~~~~~^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/myenv/lib/python3.12/site-packages/pandas/core/frame.py", line 4299, in __setitem__
    self._setitem_array(key, value)
  File "/opt/anaconda3/envs/myenv/lib/python3.12/site-packages/pandas/core/frame.py", line 4341, in _setitem_array
    check_key_length(self.columns, key, value)
  File "/opt/anaconda3/envs/myenv/lib/python3.12/site-packages/pandas/core/indexers/utils.py", line 390, in check_key_length
    raise ValueError("Columns must be same length as key")
ValueError: Columns must be same length as key
12:33:17,947 graphrag.index.reporting.file_workflow_callbacks INFO Error running pipeline! details=None
jgbradley1 commented 3 weeks ago

@Kanishk-T I see the following line in your log.

12:33:17,937 graphrag.index.verbs.graph.clustering.cluster_graph WARNING Graph has no nodes

This means that graphrag was not able to extract any entities and/or relationships from the data you are trying to index. We could improve the error handling here as this is a known edge case.

allseeworld commented 2 weeks ago

I also encountered the same issue, which happened after I upgraded to version 0.3.1. There were significant changes in this version, and many settings need to be configured in the env.

9prodhi commented 2 weeks ago

I've also encountered this issue after upgrading GraphRAG to the latest version. The problem appears to be due to significant changes in the prompts.

To resolve this, I replaced the new prompts with the older version that was previously working for me.

Kanishk-T commented 2 weeks ago

@jgbradley1 I've tried the example in the documentation after creating a fresh project and it is also met with this exact same error. As mentioned by others in the thread this is happening after the update and is not a product of using any which type of input data or models other than openAI rather due to the changes made in the last update. Again I can try to pinpoint what is causing it as I faced this same error here during the prompt tuning process and have made a PR for the same:https://github.com/microsoft/graphrag/pull/925

Same thing there due to the way the prompt was generating the examples the graph was coming out to be empty same as now.

AlonsoGuevara commented 2 weeks ago

@Kanishk-T, @9prodhi, @allseeworld I've merged #925

Haven't cut a new release to pypi but, can you please run from source to check if this solves the issue you're facing? You'll need to rerun prompt tuning for this change to take effect

Kanishk-T commented 1 week ago

Hey @AlonsoGuevara I've made the exact same changes to the default prompts as well as tested the changes on prompt tuning. Ran it on a default project and again simply removing the asterisks(**) from either side of the {record_delimiter} seems to have fixed the problem with the indexing pipeline returning the empty graph error.

pimooook commented 1 week ago

What's your context window size? I have the same issue with ollama and qwen2. But I found that the default num_ctx=2048 is too small to produce the right response. After I set the num_ctx=32000, it works.

9prodhi commented 1 week ago

Even after modifying the prompt by removing the * character, the Mistral model is still not functioning as expected. Specifically, the model is failing to extract any edges for the generated graph.

Reproduction Steps

  1. Updated the prompt by removing the * character.
  2. Ran the init with the updated prompt.
  3. Ran index

Additional Information

Data Link

Transformer

Summarized Graph Visualization

image

Settings File

encoding_model: cl100k_base
skip_workflows: []
llm:
  api_key: ${GRAPHRAG_API_KEY}
  type: openai_chat
  model: gemma2 # mistral gemma2
  model_supports_json: true
  # api_base: http://host.docker.internal:11434/v1
  api_base: http://localhost:11434/v1
  # api_base: http://127.0.0.1:7002/v1
  concurrent_requests: 24

parallelization:
  stagger: 120

async_mode: threaded

embeddings:
  async_mode: threaded
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: openai_embedding
    model: nomic-ai/nomic-embed-text-v1.5-GGUF
    # api_base: http://localhost:8001/v1/
    # api_base: http://44.200.78.20/v1/
    api_base: http://localhost:8001/v1
    concurrent_requests: 2

chunks:
  size: 300
  overlap: 100
  group_by_columns: [id]

input:
  type: file
  file_type: text
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

cache:
  type: file
  base_dir: "cache"

storage:
  type: file
  base_dir: "output/${timestamp}/artifacts"

reporting:
  type: file
  base_dir: "output/${timestamp}/reports"

entity_extraction:
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,event, Paper, Journal, Conference,Citation, Research Topic]
  max_gleanings: 0

summarize_descriptions:
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 0

community_report:
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false

umap:
  enabled: false

snapshots:
  graphml: true
  raw_entities: true
  top_level_nodes: false

Error log

hierarchical_clusters_native = gn.hierarchical_leiden(
                                   ^^^^^^^^^^^^^^^^^^^^^^^
leiden.EmptyNetworkError: EmptyNetworkError
09:15:29,615 graphrag.index.reporting.file_workflow_callbacks INFO Error running pipeline! details=None
09:15:29,623 graphrag.index.cli ERROR Errors occurred during the pipeline run, see logs for more details.
simonjoe246 commented 1 week ago

What's your context window size? I have the same issue with ollama and qwen2. But I found that the default num_ctx=2048 is too small to produce the right response. After I set the num_ctx=32000, it works.您的上下文窗口大小是多少?我对 ollama 和 qwen2 也有同样的问题。但我发现默认的 num_ctx=2048 太小,无法产生正确的响应。我设置 num_ctx=32000 后,它就起作用了。

param num_ctx means llm.max_tokens?

jackiezhangcn commented 1 day ago

any solutions to this issue? have the same issue while running Ollama (llama3.1)