[Bug] "ValueError: Columns must be same length as key" - Entity extraction fails due to invalid format returned by API

dvdtoth commented 4 months ago

Model used: GPT-4o, openai api, JSON mode. Default settings, default prompts, OAI Rate limits configured according to tier.

The pipeline fails with the error message "ValueError: Columns must be same length as key". It seems the cluster_graph() function receives an empty create_base_extracted_entities.parquet file with zero nodes (indicated at 16:40:36,22 in logs below).

Taking a closer look the entity extraction seems to fail due to invalid formatting. When the run fails the cache folder still holds the entity_extraction responses. After closer inspection the problem is in the prefix of the "result" object returned by the API.

Expected format: {"result": "(\"entity\"<|>\" ....}

Failing formats sampled from different runs: {"result": "**Entities:**\n\n(\"entity\"<|>\" ....} {"result": "##(\"entity\"<|>\" ....}

After removing the unnecessary prefix in all cached files the run can successfully index the document.

Constraining the response using logit bias / pydantic would help and validating response samples would catch this earlier. The pipeline (and cluster_graph()) should fail gracefully when there are no nodes extracted. This is a frustrating error during an expensive run on a large corpus. Looking forward to dig deeper.

Logs:

16:40:36,13 graphrag.index.run INFO Running workflow: create_base_entity_graph... 16:40:36,14 graphrag.index.run INFO dependencies for create_base_entity_graph: ['create_summarized_entities'] 16:40:36,14 graphrag.index.run INFO read table from storage: create_summarized_entities.parquet 16:40:36,22 datashaper.workflow.workflow INFO executing verb cluster_graph 16:40:36,22 graphrag.index.verbs.graph.clustering.cluster_graph WARNING Graph has no nodes 16:40:36,25 datashaper.workflow.workflow ERROR Error executing verb "cluster_graph" in create_base_entity_graph: Columns must be same length as key Traceback (most recent call last): File "/Library/Caches/pypoetry/virtualenvs/graphrag-C7j4Bq5Y-py3.11/lib/python3.11/site-packages/datashaper/workflow/workflow.py", line 410, in _execute_verb result = node.verb.func(**verb_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/graphrag/graphrag/index/verbs/graph/clustering/cluster_graph.py", line 105, in cluster_graph output_df[[level_to, to]] = pd.DataFrame(


File "/Library/Caches/pypoetry/virtualenvs/graphrag-C7j4Bq5Y-py3.11/lib/python3.11/site-packages/pandas/core/frame.py", line 4299, in __setitem__
self._setitem_array(key, value)
File "/Library/Caches/pypoetry/virtualenvs/graphrag-C7j4Bq5Y-py3.11/lib/python3.11/site-packages/pandas/core/frame.py", line 4341, in _setitem_array
check_key_length(self.columns, key, value)
File "/Library/Caches/pypoetry/virtualenvs/graphrag-C7j4Bq5Y-py3.11/lib/python3.11/site-packages/pandas/core/indexers/utils.py", line 390, in check_key_length
raise ValueError("Columns must be same length as key")
ValueError: Columns must be same length as key
16:40:36,27 graphrag.index.reporting.file_workflow_callbacks INFO Error executing verb "cluster_graph" in create_base_entity_graph: Columns must be same length as key details=None
16:40:36,27 graphrag.index.run ERROR error running workflow create_base_entity_graph
Traceback (most recent call last):
File "/workspace//graphrag/graphrag/index/run.py", line 323, in run_pipeline
result = await workflow.run(context, callbacks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Caches/pypoetry/virtualenvs/graphrag-C7j4Bq5Y-py3.11/lib/python3.11/site-packages/datashaper/workflow/workflow.py", line 369, in run
timing = await self._execute_verb(node, context, callbacks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Caches/pypoetry/virtualenvs/graphrag-C7j4Bq5Y-py3.11/lib/python3.11/site-packages/datashaper/workflow/workflow.py", line 410, in _execute_verb
result = node.verb.func(**verb_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/graphrag/graphrag/index/verbs/graph/clustering/cluster_graph.py", line 105, in cluster_graph
output_df[[level_to, to]] = pd.DataFrame(
~~~~~~~~~^^^^^^^^^^^^^^^^
File "/Library/Caches/pypoetry/virtualenvs/graphrag-C7j4Bq5Y-py3.11/lib/python3.11/site-packages/pandas/core/frame.py", line 4299, in __setitem__
self._setitem_array(key, value)
File "/Library/Caches/pypoetry/virtualenvs/graphrag-C7j4Bq5Y-py3.11/lib/python3.11/site-packages/pandas/core/frame.py", line 4341, in _setitem_array
check_key_length(self.columns, key, value)
File "/Library/Caches/pypoetry/virtualenvs/graphrag-C7j4Bq5Y-py3.11/lib/python3.11/site-packages/pandas/core/indexers/utils.py", line 390, in check_key_length
raise ValueError("Columns must be same length as key")
ValueError: Columns must be same length as key
16:40:36,27 graphrag.index.reporting.file_workflow_callbacks INFO Error running pipeline! details=None

dvdtoth commented 4 months ago

For those who are stuck after the create_base_text_units step with different error messages.

Open your project's cache/entity_extraction folder.
Take a look at any chat_XXXX file.
Your files should start with {"result": "(\"entity\" but might have formatting like {"result": "**Entities**\n\n(\"entity\".
Remove the bogus prefix with a string replace on all cached files.
Re-run in the same project folder and it might succeed.

ArneJanning commented 4 months ago

Thank you, @dvdtoth , worked for me.

ShivamGupta42 commented 4 months ago

It is not working for me even after fixing every file. Every single chat* file starts with {"result": "(\"entity\"<|> and yet I am unable to resume and getting following error

❌ create_base_entity_graph
None
⠴ GraphRAG Indexer
├── Loading Input (text) - 1 files loaded (0 filtered) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
└── create_base_entity_graph
❌ Errors occurred during the pipeline run, see logs for more details.

with same error

"Error executing verb \"cluster_graph\" in create_base_entity_graph: Columns must be same length as key"

KylinMountain commented 4 months ago

the same here

eyast commented 4 months ago

@KylinMountain @ShivamGupta42 can you have a look at the indexing-engine.log file?

ShivamGupta42 commented 4 months ago

05:01:28,42 graphrag.index.create_pipeline_config INFO skipping workflows 05:01:28,79 graphrag.index.run INFO Running pipeline 05:01:28,79 graphrag.index.storage.file_pipeline_storage INFO Creating file storage at private/rag/kg_micro/tin/how_to_help/output/20240709-043643/artifacts 05:01:28,79 graphrag.index.input.load_input INFO loading input from root_dir=input 05:01:28,79 graphrag.index.input.load_input INFO using file storage for input 05:01:28,80 graphrag.index.storage.file_pipeline_storage INFO search private/rag/kg_micro/tin/how_to_help/input for files matching .*.txt$ 05:01:28,80 graphrag.index.input.text INFO found text files from input, found [('hth_copy.txt', {})] 05:01:28,82 graphrag.index.workflows.load INFO Workflow Run Order: ['create_base_text_units', 'create_base_extracted_entities', 'create_summarized_entities', 'create_base_entity_graph', 'create_final_entities', 'create_final_nodes', 'create_final_communities', 'join_text_units_to_entity_ids', 'create_final_relationships', 'join_text_units_to_relationship_ids', 'create_final_community_reports', 'create_final_text_units', 'create_base_documents', 'create_final_documents'] 05:01:28,82 graphrag.index.run INFO Final # of rows loaded: 1 05:01:28,164 graphrag.index.run INFO Running workflow: create_base_text_units... 05:01:28,164 graphrag.index.run INFO Skipping create_base_text_units because it already exists 05:01:28,236 graphrag.index.run INFO Running workflow: create_base_extracted_entities... 05:01:28,236 graphrag.index.run INFO Skipping create_base_extracted_entities because it already exists 05:01:28,308 graphrag.index.run INFO Running workflow: create_summarized_entities... 05:01:28,308 graphrag.index.run INFO Skipping create_summarized_entities because it already exists 05:01:28,380 graphrag.index.run INFO Running workflow: create_base_entity_graph... 05:01:28,380 graphrag.index.run INFO dependencies for create_base_entity_graph: ['create_summarized_entities'] 05:01:28,380 graphrag.index.run INFO read table from storage: create_summarized_entities.parquet 05:01:28,389 datashaper.workflow.workflow INFO executing verb cluster_graph 05:01:28,389 graphrag.index.verbs.graph.clustering.cluster_graph WARNING Graph has no nodes 05:01:28,390 datashaper.workflow.workflow ERROR Error executing verb "cluster_graph" in create_base_entity_graph: Columns must be same length as key Traceback (most recent call last): File "/Users/shivamgupta/anaconda3/envs/charlie_private/lib/python3.10/site-packages/datashaper/workflow/workflow.py", line 410, in _execute_verb result = node.verb.func(verb_args) File "/Users/shivamgupta/anaconda3/envs/charlie_private/lib/python3.10/site-packages/graphrag/index/verbs/graph/clustering/cluster_graph.py", line 102, in cluster_graph output_df[[level_to, to]] = pd.DataFrame( File "/Users/shivamgupta/anaconda3/envs/charlie_private/lib/python3.10/site-packages/pandas/core/frame.py", line 4299, in setitem self._setitem_array(key, value) File "/Users/shivamgupta/anaconda3/envs/charlie_private/lib/python3.10/site-packages/pandas/core/frame.py", line 4341, in _setitem_array check_key_length(self.columns, key, value) File "/Users/shivamgupta/anaconda3/envs/charlie_private/lib/python3.10/site-packages/pandas/core/indexers/utils.py", line 390, in check_key_length raise ValueError("Columns must be same length as key") ValueError: Columns must be same length as key 05:01:28,393 graphrag.index.reporting.file_workflow_callbacks INFO Error executing verb "cluster_graph" in create_base_entity_graph: Columns must be same length as key details=None 05:01:28,393 graphrag.index.run ERROR error running workflow create_base_entity_graph Traceback (most recent call last): File "/Users/shivamgupta/anaconda3/envs/charlie_private/lib/python3.10/site-packages/graphrag/index/run.py", line 323, in run_pipeline result = await workflow.run(context, callbacks) File "/Users/shivamgupta/anaconda3/envs/charlie_private/lib/python3.10/site-packages/datashaper/workflow/workflow.py", line 369, in run timing = await self._execute_verb(node, context, callbacks) File "/Users/shivamgupta/anaconda3/envs/charlie_private/lib/python3.10/site-packages/datashaper/workflow/workflow.py", line 410, in _execute_verb result = node.verb.func(verb_args) File "/Users/shivamgupta/anaconda3/envs/charlie_private/lib/python3.10/site-packages/graphrag/index/verbs/graph/clustering/cluster_graph.py", line 102, in cluster_graph output_df[[level_to, to]] = pd.DataFrame( File "/Users/shivamgupta/anaconda3/envs/charlie_private/lib/python3.10/site-packages/pandas/core/frame.py", line 4299, in setitem self._setitem_array(key, value) File "/Users/shivamgupta/anaconda3/envs/charlie_private/lib/python3.10/site-packages/pandas/core/frame.py", line 4341, in _setitem_array check_key_length(self.columns, key, value) File "/Users/shivamgupta/anaconda3/envs/charlie_private/lib/python3.10/site-packages/pandas/core/indexers/utils.py", line 390, in check_key_length raise ValueError("Columns must be same length as key") ValueError: Columns must be same length as key 05:01:28,393 graphrag.index.reporting.file_workflow_callbacks INFO Error running pipeline! details=None

This line in log graphrag.index.verbs.graph.clustering.cluster_graph WARNING Graph has no nodes looks sus, but I don't understand why this would happen

eyast commented 4 months ago

I imagine this could happen if for a reason or another previous step was unable to extract entities. This could either be due to errors in the source text, errors caused by the prompt used to generate the entities, or errors accessing the LLM. To know exactly what happened in your case, you can consider inspecting the different files that graphrag has created (look in the artifacts folder). you can use a tool to open Parquet files (I use tad). some of the parquet files contain a single row (e.g. create_base_extracted_entities.parquet), in which you will find a graphml content. You can copy this into a text editor and save it as a .graphml file to use any tool you want to inspect the graph at that step (Gephi is very useful in visualizing the graphml content). You might also want to look at the cache folder, which includes the LLM responses for the different queries sent. Each graphrag workflow has its own subfolder there.

xxWeiDG commented 4 months ago

the same here

ShivamGupta42 commented 4 months ago

@eyast create_base_extracted_entities.parquet / create_summarized_entities.parquet both contained empty strucutres

<graphml xmlns="http://graphml.graphdrawing.org/xmlns" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
  <graph edgedefault="undirected" />
</graphml>

create_base_text_units.parquet -> did contain complete chunk_id , document_id mapping etc I am not sure what should be the next steps here. I am looking in the graph-rag repo now to understand

ShivamGupta42 commented 4 months ago

Hey folks, after spending time going through the code. I realised that the I was getting rate limiting errors in the logs and I was wrong in assuming that this is handled in the code. And due to those errors, workflows before the clustering execution were producing empty files and hence eventual error during execution of clustering.

TLDR; Please increase your rate limits on openAI. This worked for me. :D

AlonsoGuevara commented 4 months ago

Hi @ShivamGupta42 Glad this worked!

My general rule of thumb when facing this issues is:

Check the outputs of the entity extraction, this will show if the graph is empty
If the graph is empty, then it can be either faulty llm responses (unparseable) or, LLM calling failures

For rate limiting, you can try adjusting the requests_per_minute and tokens_per_minute settings

shreyn07 commented 4 months ago

❌ create_final_community_reports None ⠙ GraphRAG Indexer ├── Loading Input (InputFileType.text) - 1 files loaded (0 filtered) 100% ├── create_base_text_units ├── create_base_extracted_entities ├── create_summarized_entities ├── create_base_entity_graph ├── create_final_entities ├── create_final_nodes ├── create_final_communities ├── join_text_units_to_entity_ids ├── create_final_relationships ├── join_text_units_to_relationship_ids └── create_final_community_reports ❌ Errors occurred during the pipeline run, see logs for more details.

File "C:\Users\shrnema\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\indexes\range.py", line 417, in get_loc raise KeyError(key) KeyError: 'community' 11:17:11,578 graphrag.index.reporting.file_workflow_callbacks INFO Error running pipeline! details=None

shreyn07 commented 4 months ago

11:17:11,575 graphrag.index.run ERROR error running workflow create_final_community_reports Traceback (most recent call last): File "C:\Users\shrnema\AppData\Local\Programs\Python\Python311\Lib\site-packages\graphrag\index\run.py", line 323, in run_pipeline result = await workflow.run(context, callbacks) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\shrnema\AppData\Local\Programs\Python\Python311\Lib\site-packages\datashaper\workflow\workflow.py", line 369, in run timing = await self._execute_verb(node, context, callbacks) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\shrnema\AppData\Local\Programs\Python\Python311\Lib\site-packages\datashaper\workflow\workflow.py", line 410, in _execute_verb result = node.verb.func(**verb_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\shrnema\AppData\Local\Programs\Python\Python311\Lib\site-packages\datashaper\engine\verbs\window.py", line 73, in window window = __window_function_mapwindow_operation


  File "C:\Users\shrnema\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\frame.py", line 4102, in __getitem__
    indexer = self.columns.get_loc(key)
              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\shrnema\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\indexes\range.py", line 417, in get_loc
    raise KeyError(key)
KeyError: 'community'
11:17:11,578 graphrag.index.reporting.file_workflow_callbacks INFO Error running pipeline! details=None

dvdtoth commented 4 months ago

@AlonsoGuevara Not sure why this issue was closed. I had several runs with no rate limit errors and received unparseable responses. Is the entity extraction failure being tracked somewhere we can follow?

Fezaaan commented 4 months ago

Hey folks, after spending time going through the code. I realised that the I was getting rate limiting errors in the logs and I was wrong in assuming that this is handled in the code. And due to those errors, workflows before the clustering execution were producing empty files and hence eventual error during execution of clustering.

TLDR; Please increase your rate limits on openAI. This worked for me. :D

Hi @ShivamGupta42, do you know maybe what limits I have to put in if I use a gpt-35-turbo model from AzureOpenAI ?? I'm getting the error "EmptyNetwork" :(

dvdtoth commented 4 months ago

ValueError part of this issue is now probably solved by the better error handing since https://github.com/microsoft/graphrag/pull/532

awaescher commented 4 months ago

For those who are stuck after the create_base_text_units step with different error messages.

Open your project's cache/entity_extraction folder.

Take a look at any chat_XXXX file.

Your files should start with {"result": "(\"entity\" but might have formatting like {"result": "**Entities**\n\n(\"entity\".

Remove the bogus prefix with a string replace on all cached files.

Re-run in the same project folder and it might succeed.

This is the most helpful comment I saw so far.

I used GraphRAG with Ollama and my cache files looked more like {"result": "1. Entities: so I guessed that my LLM might not be following the prompts closely enough to match the required format.

By changing my LLM to "mistral-nemo:latest" this error was gone 🎉 and the cache files now start with the expected format {"result": "(\"entity\"<|>

However, another one occured at create_final_entities which I am currently investigating ...

dvdtoth commented 4 months ago

Invalid formatting of the responses in the entity extraction step are still breaking the process.

Another scenario the Entity delimiter being returned as **##** instead of the expected ##. Again a string replace across the entity_extraction folder allowed the process to resume.

ljhskyso commented 4 months ago

i found that if the input file is input.txt, it always throws this Columns must be same length as key error. changing the filename to anything else would resolve the issue. not sure why is it, but i finally got out of being stuck on this for 4 hours!!!

microsoft / graphrag

[Bug] "ValueError: Columns must be same length as key" - Entity extraction fails due to invalid format returned by API #443