microsoft / graphrag

A modular graph-based Retrieval-Augmented Generation (RAG) system
https://microsoft.github.io/graphrag/
MIT License
16.81k stars 1.57k forks source link

[Bug]: ValueError: Columns must be same length as key #514

Closed yuangtao closed 1 month ago

yuangtao commented 1 month ago

Describe the bug

00:58:35,677 graphrag.index.verbs.graph.clustering.cluster_graph WARNING Graph has no nodes 00:58:35,679 datashaper.workflow.workflow ERROR Error executing verb "cluster_graph" in create_base_entity_graph: Columns must be same length as key Traceback (most recent call last): File "C:\Users\Ryan\AppData\Roaming\Python\Python311\site-packages\datashaper\workflow\workflow.py", line 410, in _execute_verb result = node.verb.func(**verb_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Ryan\AppData\Roaming\Python\Python311\site-packages\graphrag\index\verbs\graph\clustering\cluster_graph.py", line 102, in cluster_graph output_df[[level_to, to]] = pd.DataFrame(


  File "C:\Users\Ryan\AppData\Roaming\Python\Python311\site-packages\pandas\core\frame.py", line 4299, in __setitem__
    self._setitem_array(key, value)
  File "C:\Users\Ryan\AppData\Roaming\Python\Python311\site-packages\pandas\core\frame.py", line 4341, in _setitem_array
    check_key_length(self.columns, key, value)
  File "C:\Users\Ryan\AppData\Roaming\Python\Python311\site-packages\pandas\core\indexers\utils.py", line 390, in check_key_length
    raise ValueError("Columns must be same length as key")
ValueError: Columns must be same length as key
00:58:35,682 graphrag.index.reporting.file_workflow_callbacks INFO Error executing verb "cluster_graph" in create_base_entity_graph: Columns must be same length as key details=None
00:58:35,682 graphrag.index.run ERROR error running workflow create_base_entity_graph
Traceback (most recent call last):
  File "C:\Users\Ryan\AppData\Roaming\Python\Python311\site-packages\graphrag\index\run.py", line 323, in run_pipeline
    result = await workflow.run(context, callbacks)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Ryan\AppData\Roaming\Python\Python311\site-packages\datashaper\workflow\workflow.py", line 369, in run
    timing = await self._execute_verb(node, context, callbacks)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Ryan\AppData\Roaming\Python\Python311\site-packages\datashaper\workflow\workflow.py", line 410, in _execute_verb
    result = node.verb.func(**verb_args)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Ryan\AppData\Roaming\Python\Python311\site-packages\graphrag\index\verbs\graph\clustering\cluster_graph.py", line 102, in cluster_graph
    output_df[[level_to, to]] = pd.DataFrame(
    ~~~~~~~~~^^^^^^^^^^^^^^^^
  File "C:\Users\Ryan\AppData\Roaming\Python\Python311\site-packages\pandas\core\frame.py", line 4299, in __setitem__
    self._setitem_array(key, value)
  File "C:\Users\Ryan\AppData\Roaming\Python\Python311\site-packages\pandas\core\frame.py", line 4341, in _setitem_array
    check_key_length(self.columns, key, value)
  File "C:\Users\Ryan\AppData\Roaming\Python\Python311\site-packages\pandas\core\indexers\utils.py", line 390, in check_key_length
    raise ValueError("Columns must be same length as key")
ValueError: Columns must be same length as key

### Steps to reproduce

用本地部署的大模型复现demo,出现报错

### Expected Behavior

_No response_

### GraphRAG Config Used

_No response_

### Logs and screenshots

_No response_

### Additional Information

- GraphRAG Version:
- Operating System:
- Python Version:
- Related Issues:
AlonsoGuevara commented 1 month ago

Hi @yuangtao Could you please share your config file?

yuangtao commented 1 month ago

Hi @yuangtao Could you please share your config file?

encoding_model: cl100k_base skip_workflows: [] llm: api_key: ${GRAPHRAG_API_KEY} type: openai_chat # or azure_openai_chat model: qwen2-0.5b model_supports_json: true # recommended if this is available for your model. max_tokens: 1024 #4000

request_timeout: 180.0

api_base: http://localhost:1234/v1

embeddings:

parallelization: override the global parallelization settings for embeddings

async_mode: threaded # or asyncio llm: api_key: ${GRAPHRAG_API_KEY} type: openai_embedding # or azure_openai_embedding model: nomic-embed-text-v1.5.Q2_K api_base: http://localhost:1234/v1

yuangtao commented 1 month ago

Hi @yuangtao Could you please share your config file?

I used LM Studio for local deployment.

SeanFeng91 commented 1 month ago

same issue, is this problem related to the model? not openai

SeanFeng91 commented 1 month ago

I may find the reason. I use the agicto api(api_base: https://api.agicto.cn/v1) with deepseek-chat&text-embedding-3-small, it works. My issue of "Columns must be same length as key, Errors occurred during the pipeline run" may caused by wrong api_base format, which i was written as api_base:

gubinjie commented 1 month ago

api_base path should be added /v1

etiennebonnafoux commented 1 month ago

I have dug a little the issue. The problem is when the LLM generate an empty answer or there is a problem parsing it.

Then in the module cluster_graph.py graphrag try to execute (line 122)

output_df[[level_to, to]] = pd.DataFrame(
            output_df[to].tolist(), index=output_df.index
        )

with typically

level_to = "level"
to = "clustered_graph"
output_df_index = RangeIndex(start=0, stop=1, step=1)

and image This does'nt work since image has not the good number of column.

Now there is two choice :

etiennebonnafoux commented 1 month ago

In both case there should be a more explicit message in the log than this panda Error.

natoverse commented 1 month ago

We see this issue filed commonly with models that return an unexpected format. Routing to the consolidated alternate model providers issue #657.

etiennebonnafoux commented 1 month ago

We see this issue filed commonly with models that return an unexpected format. Routing to the consolidated alternate model providers issue #657.

But I do use Azure OpenAI. So it's not only an alternate model issue.