microsoft / graphrag

A modular graph-based Retrieval-Augmented Generation (RAG) system
https://microsoft.github.io/graphrag/
MIT License
18.9k stars 1.85k forks source link

[Bug]: < File "C:\ProgramData\anaconda3\envs\graphrag_env0716\Lib\site-packages\pandas\core\indexers\utils.py", line 390, in check_key_length raise ValueError("Columns must be same length as key")> #631

Closed myyourgit closed 3 months ago

myyourgit commented 3 months ago

Describe the bug

there is below error log pring while running >python -m graphrag.index --root ./ragtest0716

FO dependencies for create_base_entity_graph: ['create_summarized_entities'] 22:56:23,463 graphrag.index.run INFO read table from storage: create_summarized_entities.parquet 22:56:23,487 datashaper.workflow.workflow INFO executing verb cluster_graph 22:56:23,501 graphrag.index.verbs.graph.clustering.cluster_graph WARNING Graph has no nodes 22:56:23,514 datashaper.workflow.workflow ERROR Error executing verb "cluster_graph" in create_base_entity_graph: Columns must be same length as key Traceback (most recent call last): File "C:\ProgramData\anaconda3\envs\graphrag_env0716\Lib\site-packages\datashaper\workflow\workflow.py", line 410, in _execute_verb result = node.verb.func(**verb_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\ProgramData\anaconda3\envs\graphrag_env0716\Lib\site-packages\graphrag\index\verbs\graph\clustering\cluster_graph.py", line 102, in cluster_graph output_df[[level_to, to]] = pd.DataFrame(


  File "C:\ProgramData\anaconda3\envs\graphrag_env0716\Lib\site-packages\pandas\core\frame.py", line 4299, in __setitem__
    self._setitem_array(key, value)
  File "C:\ProgramData\anaconda3\envs\graphrag_env0716\Lib\site-packages\pandas\core\frame.py", line 4341, in _setitem_array
    check_key_length(self.columns, key, value)
  File "C:\ProgramData\anaconda3\envs\graphrag_env0716\Lib\site-packages\pandas\core\indexers\utils.py", line 390, in check_key_length
    raise ValueError("Columns must be same length as key")
ValueError: Columns must be same length as key
22:56:23,523 graphrag.index.reporting.file_workflow_callbacks INFO Error executing verb "cluster_graph" in create_base_entity_graph: Columns must be same length as key details=None
22:56:23,523 graphrag.index.run ERROR error running workflow create_base_entity_graph
Traceback (most recent call last):
  File "C:\ProgramData\anaconda3\envs\graphrag_env0716\Lib\site-packages\graphrag\index\run.py", line 323, in run_pipeline
    result = await workflow.run(context, callbacks)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\envs\graphrag_env0716\Lib\site-packages\datashaper\workflow\workflow.py", line 369, in run
    timing = await self._execute_verb(node, context, callbacks)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\envs\graphrag_env0716\Lib\site-packages\datashaper\workflow\workflow.py", line 410, in _execute_verb
    result = node.verb.func(**verb_args)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\envs\graphrag_env0716\Lib\site-packages\graphrag\index\verbs\graph\clustering\cluster_graph.py", line 102, in cluster_graph
    output_df[[level_to, to]] = pd.DataFrame(
    ~~~~~~~~~^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\envs\graphrag_env0716\Lib\site-packages\pandas\core\frame.py", line 4299, in __setitem__
    self._setitem_array(key, value)
  File "C:\ProgramData\anaconda3\envs\graphrag_env0716\Lib\site-packages\pandas\core\frame.py", line 4341, in _setitem_array
    check_key_length(self.columns, key, value)
  File "C:\ProgramData\anaconda3\envs\graphrag_env0716\Lib\site-packages\pandas\core\indexers\utils.py", line 390, in check_key_length
    raise ValueError("Columns must be same length as key")
ValueError: Columns must be same length as key
22:56:23,526 graphrag.index.reporting.file_workflow_callbacks INFO Error running pipeline! details=None

### Steps to reproduce

run python -m graphrag.index --root ./ragtest0716

### Expected Behavior

_No response_

### GraphRAG Config Used

run lm_studio,  
enable gemma 2b in LLM model.
enable nomic AI in embedding model.

setting.yaml.

encoding_model: cl100k_base
skip_workflows: []
llm:
  api_key: lm-studio
  type: openai_chat # or azure_openai_chat
  model: gemma-2b-it-GGUF/gemma-2b-it-q8_0.gguf
  model_supports_json: true # recommended if this is available for your model.
  api_base: http://localhost:1234/v1
  # max_tokens: 4000
  # request_timeout: 180.0
  # api_base: https://<instance>.openai.azure.com
  # api_version: 2024-02-15-preview
  # organization: <organization_id>
  # deployment_name: <azure_model_deployment_name>
  # tokens_per_minute: 150_000 # set a leaky bucket throttle
  # requests_per_minute: 10_000 # set a leaky bucket throttle
  # max_retries: 10
  # max_retry_wait: 10.0
  # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
  # concurrent_requests: 25 # the number of parallel inflight requests that may be made

parallelization:
  stagger: 0.3
  # num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:
  ## parallelization: override the global parallelization settings for embeddings
  async_mode: threaded # or asyncio
  llm:
    api_key: lm-studio
    type: openai_embedding # or azure_openai_embedding
    model: nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q4_K_M.gguf
    api_base: http://localhost:1234/v1

   # api_version: 2024-02-15-preview
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>
    # tokens_per_minute: 150_000 # set a leaky bucket throttle
    # requests_per_minute: 10_000 # set a leaky bucket throttle
    max_retries: 100
    max_retry_wait: 10.0
    # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
    # concurrent_requests: 25 # the number of parallel inflight requests that may be made
    # batch_size: 16 # the number of documents to send in a single request
    # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
    # target: required # or optional

### Logs and screenshots

{"type": "error", "data": "Error executing verb \"cluster_graph\" in create_base_entity_graph: Columns must be same length as key", "stack": "Traceback (most recent call last):\n  File \"C:\\ProgramData\\anaconda3\\envs\\graphrag_env0716\\Lib\\site-packages\\datashaper\\workflow\\workflow.py\", line 410, in _execute_verb\n    result = node.verb.func(**verb_args)\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"C:\\ProgramData\\anaconda3\\envs\\graphrag_env0716\\Lib\\site-packages\\graphrag\\index\\verbs\\graph\\clustering\\cluster_graph.py\", line 102, in cluster_graph\n    output_df[[level_to, to]] = pd.DataFrame(\n    ~~~~~~~~~^^^^^^^^^^^^^^^^\n  File \"C:\\ProgramData\\anaconda3\\envs\\graphrag_env0716\\Lib\\site-packages\\pandas\\core\\frame.py\", line 4299, in __setitem__\n    self._setitem_array(key, value)\n  File \"C:\\ProgramData\\anaconda3\\envs\\graphrag_env0716\\Lib\\site-packages\\pandas\\core\\frame.py\", line 4341, in _setitem_array\n    check_key_length(self.columns, key, value)\n  File \"C:\\ProgramData\\anaconda3\\envs\\graphrag_env0716\\Lib\\site-packages\\pandas\\core\\indexers\\utils.py\", line 390, in check_key_length\n    raise ValueError(\"Columns must be same length as key\")\nValueError: Columns must be same length as key\n", "source": "Columns must be same length as key", "details": null}
{"type": "error", "data": "Error running pipeline!", "stack": "Traceback (most recent call last):\n  File \"C:\\ProgramData\\anaconda3\\envs\\graphrag_env0716\\Lib\\site-packages\\graphrag\\index\\run.py\", line 323, in run_pipeline\n    result = await workflow.run(context, callbacks)\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"C:\\ProgramData\\anaconda3\\envs\\graphrag_env0716\\Lib\\site-packages\\datashaper\\workflow\\workflow.py\", line 369, in run\n    timing = await self._execute_verb(node, context, callbacks)\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"C:\\ProgramData\\anaconda3\\envs\\graphrag_env0716\\Lib\\site-packages\\datashaper\\workflow\\workflow.py\", line 410, in _execute_verb\n    result = node.verb.func(**verb_args)\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"C:\\ProgramData\\anaconda3\\envs\\graphrag_env0716\\Lib\\site-packages\\graphrag\\index\\verbs\\graph\\clustering\\cluster_graph.py\", line 102, in cluster_graph\n    output_df[[level_to, to]] = pd.DataFrame(\n    ~~~~~~~~~^^^^^^^^^^^^^^^^\n  File \"C:\\ProgramData\\anaconda3\\envs\\graphrag_env0716\\Lib\\site-packages\\pandas\\core\\frame.py\", line 4299, in __setitem__\n    self._setitem_array(key, value)\n  File \"C:\\ProgramData\\anaconda3\\envs\\graphrag_env0716\\Lib\\site-packages\\pandas\\core\\frame.py\", line 4341, in _setitem_array\n    check_key_length(self.columns, key, value)\n  File \"C:\\ProgramData\\anaconda3\\envs\\graphrag_env0716\\Lib\\site-packages\\pandas\\core\\indexers\\utils.py\", line 390, in check_key_length\n    raise ValueError(\"Columns must be same length as key\")\nValueError: Columns must be same length as key\n", "source": "Columns must be same length as key", "details": null}

### Additional Information

- GraphRAG Version: latest
- Operating System: win10
- Python Version: 3.11.9
- Related Issues: 
AlonsoGuevara commented 3 months ago

Hi!

This is generally caused by faulty entity extraction. I would recommend taking a look at the generated cache files for this step, it could be that either the LLM returned a malformatted response, or that it is being chatty when answering.

We are centralizing other LLM discussions in these threads: Other LLM/Api bases: #339, Ollama: #345 Local embeddings: #370

I'll resolve this issue so we can keep the focus on those threads

myyourgit commented 3 months ago

Hi!

This is generally caused by faulty entity extraction. I would recommend taking a look at the generated cache files for this step, it could be that either the LLM returned a malformatted response, or that it is being chatty when answering.

We are centralizing other LLM discussions in these threads: Other LLM/Api bases: #339, Ollama: #345 Local embeddings: #370

I'll resolve this issue so we can keep the focus on those threads

Hi, Alonso: Thank you! above 3 three thread seems to resolve the local ollama setting issue, not my issue. Thanks