Closed dvdtoth closed 4 months ago
For those who are stuck after the create_base_text_units step with different error messages.
chat_XXXX
file.{"result": "(\"entity\"
but might have formatting like {"result": "**Entities**\n\n(\"entity\"
.Thank you, @dvdtoth , worked for me.
It is not working for me even after fixing every file.
Every single chat* file starts with {"result": "(\"entity\"<|>
and yet I am unable to resume and getting following error
❌ create_base_entity_graph
None
⠴ GraphRAG Indexer
├── Loading Input (text) - 1 files loaded (0 filtered) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
└── create_base_entity_graph
❌ Errors occurred during the pipeline run, see logs for more details.
with same error
"Error executing verb \"cluster_graph\" in create_base_entity_graph: Columns must be same length as key"
the same here
@KylinMountain @ShivamGupta42 can you have a look at the indexing-engine.log
file?
05:01:28,42 graphrag.index.create_pipeline_config INFO skipping workflows 05:01:28,79 graphrag.index.run INFO Running pipeline 05:01:28,79 graphrag.index.storage.file_pipeline_storage INFO Creating file storage at private/rag/kg_micro/tin/how_to_help/output/20240709-043643/artifacts 05:01:28,79 graphrag.index.input.load_input INFO loading input from root_dir=input 05:01:28,79 graphrag.index.input.load_input INFO using file storage for input 05:01:28,80 graphrag.index.storage.file_pipeline_storage INFO search private/rag/kg_micro/tin/how_to_help/input for files matching .*.txt$ 05:01:28,80 graphrag.index.input.text INFO found text files from input, found [('hth_copy.txt', {})] 05:01:28,82 graphrag.index.workflows.load INFO Workflow Run Order: ['create_base_text_units', 'create_base_extracted_entities', 'create_summarized_entities', 'create_base_entity_graph', 'create_final_entities', 'create_final_nodes', 'create_final_communities', 'join_text_units_to_entity_ids', 'create_final_relationships', 'join_text_units_to_relationship_ids', 'create_final_community_reports', 'create_final_text_units', 'create_base_documents', 'create_final_documents'] 05:01:28,82 graphrag.index.run INFO Final # of rows loaded: 1 05:01:28,164 graphrag.index.run INFO Running workflow: create_base_text_units... 05:01:28,164 graphrag.index.run INFO Skipping create_base_text_units because it already exists 05:01:28,236 graphrag.index.run INFO Running workflow: create_base_extracted_entities... 05:01:28,236 graphrag.index.run INFO Skipping create_base_extracted_entities because it already exists 05:01:28,308 graphrag.index.run INFO Running workflow: create_summarized_entities... 05:01:28,308 graphrag.index.run INFO Skipping create_summarized_entities because it already exists 05:01:28,380 graphrag.index.run INFO Running workflow: create_base_entity_graph... 05:01:28,380 graphrag.index.run INFO dependencies for create_base_entity_graph: ['create_summarized_entities'] 05:01:28,380 graphrag.index.run INFO read table from storage: create_summarized_entities.parquet 05:01:28,389 datashaper.workflow.workflow INFO executing verb cluster_graph 05:01:28,389 graphrag.index.verbs.graph.clustering.cluster_graph WARNING Graph has no nodes 05:01:28,390 datashaper.workflow.workflow ERROR Error executing verb "cluster_graph" in create_base_entity_graph: Columns must be same length as key Traceback (most recent call last): File "/Users/shivamgupta/anaconda3/envs/charlie_private/lib/python3.10/site-packages/datashaper/workflow/workflow.py", line 410, in _execute_verb result = node.verb.func(verb_args) File "/Users/shivamgupta/anaconda3/envs/charlie_private/lib/python3.10/site-packages/graphrag/index/verbs/graph/clustering/cluster_graph.py", line 102, in cluster_graph output_df[[level_to, to]] = pd.DataFrame( File "/Users/shivamgupta/anaconda3/envs/charlie_private/lib/python3.10/site-packages/pandas/core/frame.py", line 4299, in setitem self._setitem_array(key, value) File "/Users/shivamgupta/anaconda3/envs/charlie_private/lib/python3.10/site-packages/pandas/core/frame.py", line 4341, in _setitem_array check_key_length(self.columns, key, value) File "/Users/shivamgupta/anaconda3/envs/charlie_private/lib/python3.10/site-packages/pandas/core/indexers/utils.py", line 390, in check_key_length raise ValueError("Columns must be same length as key") ValueError: Columns must be same length as key 05:01:28,393 graphrag.index.reporting.file_workflow_callbacks INFO Error executing verb "cluster_graph" in create_base_entity_graph: Columns must be same length as key details=None 05:01:28,393 graphrag.index.run ERROR error running workflow create_base_entity_graph Traceback (most recent call last): File "/Users/shivamgupta/anaconda3/envs/charlie_private/lib/python3.10/site-packages/graphrag/index/run.py", line 323, in run_pipeline result = await workflow.run(context, callbacks) File "/Users/shivamgupta/anaconda3/envs/charlie_private/lib/python3.10/site-packages/datashaper/workflow/workflow.py", line 369, in run timing = await self._execute_verb(node, context, callbacks) File "/Users/shivamgupta/anaconda3/envs/charlie_private/lib/python3.10/site-packages/datashaper/workflow/workflow.py", line 410, in _execute_verb result = node.verb.func(verb_args) File "/Users/shivamgupta/anaconda3/envs/charlie_private/lib/python3.10/site-packages/graphrag/index/verbs/graph/clustering/cluster_graph.py", line 102, in cluster_graph output_df[[level_to, to]] = pd.DataFrame( File "/Users/shivamgupta/anaconda3/envs/charlie_private/lib/python3.10/site-packages/pandas/core/frame.py", line 4299, in setitem self._setitem_array(key, value) File "/Users/shivamgupta/anaconda3/envs/charlie_private/lib/python3.10/site-packages/pandas/core/frame.py", line 4341, in _setitem_array check_key_length(self.columns, key, value) File "/Users/shivamgupta/anaconda3/envs/charlie_private/lib/python3.10/site-packages/pandas/core/indexers/utils.py", line 390, in check_key_length raise ValueError("Columns must be same length as key") ValueError: Columns must be same length as key 05:01:28,393 graphrag.index.reporting.file_workflow_callbacks INFO Error running pipeline! details=None
This line in log graphrag.index.verbs.graph.clustering.cluster_graph WARNING Graph has no nodes
looks sus, but I don't understand why this would happen
I imagine this could happen if for a reason or another previous step was unable to extract entities. This could either be due to errors in the source text, errors caused by the prompt used to generate the entities, or errors accessing the LLM. To know exactly what happened in your case, you can consider inspecting the different files that graphrag has created (look in the artifacts
folder). you can use a tool to open Parquet files (I use tad). some of the parquet files contain a single row (e.g. create_base_extracted_entities.parquet
), in which you will find a graphml content. You can copy this into a text editor and save it as a .graphml
file to use any tool you want to inspect the graph at that step (Gephi is very useful in visualizing the graphml content).
You might also want to look at the cache
folder, which includes the LLM responses for the different queries sent. Each graphrag workflow has its own subfolder there.
the same here
@eyast create_base_extracted_entities.parquet / create_summarized_entities.parquet both contained empty strucutres
<graphml xmlns="http://graphml.graphdrawing.org/xmlns" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
<graph edgedefault="undirected" />
</graphml>
create_base_text_units.parquet -> did contain complete chunk_id , document_id mapping etc I am not sure what should be the next steps here. I am looking in the graph-rag repo now to understand
Hey folks, after spending time going through the code. I realised that the I was getting rate limiting errors in the logs and I was wrong in assuming that this is handled in the code. And due to those errors, workflows before the clustering execution were producing empty files and hence eventual error during execution of clustering.
TLDR; Please increase your rate limits on openAI. This worked for me. :D
Hi @ShivamGupta42 Glad this worked!
My general rule of thumb when facing this issues is:
For rate limiting, you can try adjusting the requests_per_minute
and tokens_per_minute
settings
❌ create_final_community_reports None ⠙ GraphRAG Indexer ├── Loading Input (InputFileType.text) - 1 files loaded (0 filtered) 100% ├── create_base_text_units ├── create_base_extracted_entities ├── create_summarized_entities ├── create_base_entity_graph ├── create_final_entities ├── create_final_nodes ├── create_final_communities ├── join_text_units_to_entity_ids ├── create_final_relationships ├── join_text_units_to_relationship_ids └── create_final_community_reports ❌ Errors occurred during the pipeline run, see logs for more details.
File "C:\Users\shrnema\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\indexes\range.py", line 417, in get_loc raise KeyError(key) KeyError: 'community' 11:17:11,578 graphrag.index.reporting.file_workflow_callbacks INFO Error running pipeline! details=None
11:17:11,575 graphrag.index.run ERROR error running workflow create_final_community_reports Traceback (most recent call last): File "C:\Users\shrnema\AppData\Local\Programs\Python\Python311\Lib\site-packages\graphrag\index\run.py", line 323, in run_pipeline result = await workflow.run(context, callbacks) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\shrnema\AppData\Local\Programs\Python\Python311\Lib\site-packages\datashaper\workflow\workflow.py", line 369, in run timing = await self._execute_verb(node, context, callbacks) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\shrnema\AppData\Local\Programs\Python\Python311\Lib\site-packages\datashaper\workflow\workflow.py", line 410, in _execute_verb result = node.verb.func(**verb_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\shrnema\AppData\Local\Programs\Python\Python311\Lib\site-packages\datashaper\engine\verbs\window.py", line 73, in window window = __window_function_mapwindow_operation
File "C:\Users\shrnema\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\frame.py", line 4102, in __getitem__
indexer = self.columns.get_loc(key)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\shrnema\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\indexes\range.py", line 417, in get_loc
raise KeyError(key)
KeyError: 'community'
11:17:11,578 graphrag.index.reporting.file_workflow_callbacks INFO Error running pipeline! details=None
@AlonsoGuevara Not sure why this issue was closed. I had several runs with no rate limit errors and received unparseable responses. Is the entity extraction failure being tracked somewhere we can follow?
Hey folks, after spending time going through the code. I realised that the I was getting rate limiting errors in the logs and I was wrong in assuming that this is handled in the code. And due to those errors, workflows before the clustering execution were producing empty files and hence eventual error during execution of clustering.
TLDR; Please increase your rate limits on openAI. This worked for me. :D
Hi @ShivamGupta42, do you know maybe what limits I have to put in if I use a gpt-35-turbo model from AzureOpenAI ?? I'm getting the error "EmptyNetwork" :(
ValueError part of this issue is now probably solved by the better error handing since https://github.com/microsoft/graphrag/pull/532
For those who are stuck after the create_base_text_units step with different error messages.
- Open your project's cache/entity_extraction folder.
- Take a look at any
chat_XXXX
file.- Your files should start with
{"result": "(\"entity\"
but might have formatting like{"result": "**Entities**\n\n(\"entity\"
.- Remove the bogus prefix with a string replace on all cached files.
- Re-run in the same project folder and it might succeed.
This is the most helpful comment I saw so far.
I used GraphRAG with Ollama and my cache files looked more like {"result": "1. Entities:
so I guessed that my LLM might not be following the prompts closely enough to match the required format.
By changing my LLM to "mistral-nemo:latest" this error was gone 🎉 and the cache files now start with the expected format {"result": "(\"entity\"<|>
However, another one occured at create_final_entities
which I am currently investigating ...
Invalid formatting of the responses in the entity extraction step are still breaking the process.
Another scenario the Entity delimiter being returned as **##**
instead of the expected ##
. Again a string replace across the entity_extraction
folder allowed the process to resume.
i found that if the input file is input.txt
, it always throws this Columns must be same length as key
error. changing the filename to anything else would resolve the issue. not sure why is it, but i finally got out of being stuck on this for 4 hours!!!
Model used: GPT-4o, openai api, JSON mode. Default settings, default prompts, OAI Rate limits configured according to tier.
The pipeline fails with the error message "ValueError: Columns must be same length as key". It seems the cluster_graph() function receives an empty create_base_extracted_entities.parquet file with zero nodes (indicated at 16:40:36,22 in logs below).
Taking a closer look the entity extraction seems to fail due to invalid formatting. When the run fails the cache folder still holds the entity_extraction responses. After closer inspection the problem is in the prefix of the "result" object returned by the API.
Expected format:
{"result": "(\"entity\"<|>\" ....}
Failing formats sampled from different runs:
{"result": "**Entities:**\n\n(\"entity\"<|>\" ....}
{"result": "##(\"entity\"<|>\" ....}
After removing the unnecessary prefix in all cached files the run can successfully index the document.
Constraining the response using logit bias / pydantic would help and validating response samples would catch this earlier. The pipeline (and cluster_graph()) should fail gracefully when there are no nodes extracted. This is a frustrating error during an expensive run on a large corpus. Looking forward to dig deeper.
Logs:
Possibly related issues: https://github.com/microsoft/graphrag/issues/437 https://github.com/microsoft/graphrag/issues/414 https://github.com/microsoft/graphrag/issues/426 https://github.com/microsoft/graphrag/issues/441