microsoft / graphrag

A modular graph-based Retrieval-Augmented Generation (RAG) system
https://microsoft.github.io/graphrag/
MIT License
7.29k stars 565 forks source link

Clustering crashes: ValueError("Columns must be same length as key") - too little input text maybe? #362

Closed simoncelinder closed 2 days ago

simoncelinder commented 3 days ago

Hi!

I was able to reproduce the example at: https://microsoft.github.io/graphrag/posts/get_started/

However when I switch to use the exact same method but with some shorter fictional stories, it crashes during the clustering part.

Text input: The input text is that I paste this into a txt: https://gist.github.com/simoncelinder/0fbb9aaebed1e21801ab6c6e11a0dda5

Error: When then running the python -m graphrag.index --root ./ragtest I get (where my added printouts suggest that the cluster_graph function gets empty list input): image

Maybe less relevant since downstream from this problem - inspecting the log files suggest shape mismatch:

21:21:04,371 graphrag.index.run ERROR error running workflow create_base_entity_graph Traceback (most recent call last): File "/Users/simon/git/quick-exp/.pyenv/lib/python3.11/site-packages/graphrag/index/run.py", line 323, in run_pipeline result = await workflow.run(context, callbacks) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/simon/git/quick-exp/.pyenv/lib/python3.11/site-packages/datashaper/workflow/workflow.py", line 369, in run timing = await self._execute_verb(node, context, callbacks) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/simon/git/quick-exp/.pyenv/lib/python3.11/site-packages/datashaper/workflow/workflow.py", line 410, in _execute_verb result = node.verb.func(**verb_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/simon/git/quick-exp/.pyenv/lib/python3.11/site-packages/graphrag/index/verbs/graph/clustering/cluster_graph.py", line 102, in cluster_graph output_df[[level_to, to]] = pd.DataFrame(


  File "/Users/simon/git/quick-exp/.pyenv/lib/python3.11/site-packages/pandas/core/frame.py", line 4299, in __setitem__
    self._setitem_array(key, value)
  File "/Users/simon/git/quick-exp/.pyenv/lib/python3.11/site-packages/pandas/core/frame.py", line 4341, in _setitem_array
    check_key_length(self.columns, key, value)
  File "/Users/simon/git/quick-exp/.pyenv/lib/python3.11/site-packages/pandas/core/indexers/utils.py", line 390, in check_key_length
    raise ValueError("Columns must be same length as key")
ValueError: Columns must be same length as key

So to reproduce this problem one would just put the text example in the input file (input/book.txt) and execute exactly as in the guide. I tried tweaking some params in settings.yaml as I assumed the problem was with the shorter input text, like various lengths, chunk sizes, max num clusters etc but without any luck so far.

My versions:

Any ideas? :-)

Thanks in advance and this seems like a really nice tool!

eyast commented 2 days ago

I was not able to reproduce the error you've faced. The process completed successfully on my end, and I can see the communities generated, with their summaries. In my setup, I use GPT4-o, otherwise it's based on the standard library. If you explore the folder structure in outputs, you can find artefacts, as well as interesting logs outputs\{timestamp}\reports\ . For example, make sure that you are not hitting some rate limits that prevent you from proceeding further down in the pipeline process. You can also find the artifacts of each step generated in a parquet file in the artifacts folder. I use tad to explore the contents of the files. PS: Writing part of the story in 1st person is ingenious if you ask me - I wonder if you need to modify your prompt or entity configuration to make sure the LLM retrieves the narrator as an entity.

simoncelinder commented 2 days ago

Ok will try again with GPT4-o!

(The main idea is to test the capability to combine together stories told from different perspectives also “about” the central person, hence not always first person perspective, thanks for the input about checking the prompt though 👍🏻.)

simoncelinder commented 2 days ago

Seems to work now, maybe it was just my project that was in some weird state or the env variables having comments, other variables in .env or not names exactly right. Works with all the defaults incl default LLM. Thanks for the help! 💪🏻