microsoft / graphrag

A modular graph-based Retrieval-Augmented Generation (RAG) system
https://microsoft.github.io/graphrag/
MIT License
17.12k stars 1.61k forks source link

[Bug]: csv filetype throws error while indexing #588

Closed tonypius closed 1 week ago

tonypius commented 1 month ago

Describe the bug

While loading csv from the input folder, the indexing step fails with error "Error executing verb "zip" in create_base_text_units: 'text' "

Steps to reproduce

I have graph rag setup with azure openai and i successfully ran it on a txt file. But when i tried to load 11 csv files, you can see in the logs below it loads the files properly and fails when the pipeline starts.

Logs and screenshots

11:32:54,901 graphrag.index.input.csv INFO loading 11 csv files 11:32:54,903 graphrag.index.input.csv INFO Total number of unfiltered csv rows: 13469 11:32:54,905 graphrag.index.workflows.load INFO Workflow Run Order: ['create_base_text_units', 'create_base_extracted_entities', 'create_summarized_entities', 'create_base_entity_graph', 'create_final_entities', 'create_final_nodes', 'create_final_communities', 'join_text_units_to_entity_ids', 'create_final_relationships', 'join_text_units_to_relationship_ids', 'create_final_community_reports', 'create_final_text_units', 'create_base_documents', 'create_final_documents'] 11:32:54,905 graphrag.index.run INFO Final # of rows loaded: 13469 11:32:55,105 graphrag.index.run INFO Running workflow: create_base_text_units... 11:32:55,105 graphrag.index.run INFO dependencies for create_base_text_units: [] 11:32:55,111 datashaper.workflow.workflow INFO executing verb orderby 11:32:55,128 datashaper.workflow.workflow INFO executing verb zip 11:32:55,128 datashaper.workflow.workflow ERROR Error executing verb "zip" in create_base_text_units: 'text'

automateyournetwork commented 1 month ago

I can't even get a single CSV to work

08:55:25,872 datashaper.workflow.workflow ERROR Error executing verb "zip" in create_base_text_units: 'text' Traceback (most recent call last): File "/home/fragb0x/GRAPH/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3805, in get_loc return self._engine.get_loc(casted_key) File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc File "index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 7081, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 7089, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'text'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/fragb0x/GRAPH/lib/python3.10/site-packages/datashaper/workflow/workflow.py", line 410, in _execute_verb result = node.verb.func(*verb_args) File "/home/fragb0x/GRAPH/lib/python3.10/site-packages/graphrag/index/verbs/zip.py", line 29, in zip_verb table[to] = list(zip([table[col] for col in columns], strict=True)) File "/home/fragb0x/GRAPH/lib/python3.10/site-packages/graphrag/index/verbs/zip.py", line 29, in table[to] = list(zip(*[table[col] for col in columns], strict=True)) File "/home/fragb0x/GRAPH/lib/python3.10/site-packages/pandas/core/frame.py", line 4102, in getitem indexer = self.columns.get_loc(key) File "/home/fragb0x/GRAPH/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3812, in get_loc raise KeyError(key) from err KeyError: 'text' 08:55:25,876 graphrag.index.reporting.file_workflow_callbacks INFO Error executing verb "zip" in create_base_text_units: 'text' details=None 08:55:25,876 graphrag.index.run ERROR error running workflow create_base_text_units Traceback (most recent call last): File "/home/fragb0x/GRAPH/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3805, in get_loc return self._engine.get_loc(casted_key) File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc File "index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 7081, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 7089, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'text'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/fragb0x/GRAPH/lib/python3.10/site-packages/graphrag/index/run.py", line 323, in run_pipeline result = await workflow.run(context, callbacks) File "/home/fragb0x/GRAPH/lib/python3.10/site-packages/datashaper/workflow/workflow.py", line 369, in run timing = await self._execute_verb(node, context, callbacks) File "/home/fragb0x/GRAPH/lib/python3.10/site-packages/datashaper/workflow/workflow.py", line 410, in _execute_verb result = node.verb.func(*verb_args) File "/home/fragb0x/GRAPH/lib/python3.10/site-packages/graphrag/index/verbs/zip.py", line 29, in zip_verb table[to] = list(zip([table[col] for col in columns], strict=True)) File "/home/fragb0x/GRAPH/lib/python3.10/site-packages/graphrag/index/verbs/zip.py", line 29, in table[to] = list(zip(*[table[col] for col in columns], strict=True)) File "/home/fragb0x/GRAPH/lib/python3.10/site-packages/pandas/core/frame.py", line 4102, in getitem indexer = self.columns.get_loc(key) File "/home/fragb0x/GRAPH/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3812, in get_loc raise KeyError(key) from err KeyError: 'text' 08:55:25,877 graphrag.index.reporting.file_workflow_callbacks INFO Error running pipeline! details=None

jasonmaoverlord commented 1 month ago

I have solved the problem. If you modify settings.yaml well. your csv file must have a column named "text".

Jaisaxena16 commented 1 month ago
self._reader = parsers.TextReader(src, **kwds)

File "parsers.pyx", line 574, in pandas._libs.parsers.TextReader.cinit File "parsers.pyx", line 663, in pandas._libs.parsers.TextReader._get_header File "parsers.pyx", line 874, in pandas._libs.parsers.TextReader._tokenize_rows File "parsers.pyx", line 891, in pandas._libs.parsers.TextReader._check_tokenize_status File "parsers.pyx", line 2053, in pandas._libs.parsers.raise_parser_error UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 3355: invalid start byte Sentry is attempting to send 2 pending events Waiting up to 2 seconds

please help, csv fils as inpt not working, changed settings.yaml still throwing err or

github-actions[bot] commented 3 weeks ago

This issue has been marked stale due to inactivity after repo maintainer or community member responses that request more information or suggest a solution. It will be closed after five additional days.

github-actions[bot] commented 2 weeks ago

This issue has been marked stale due to inactivity after repo maintainer or community member responses that request more information or suggest a solution. It will be closed after five additional days.

github-actions[bot] commented 1 week ago

This issue has been closed after being marked as stale for five days. Please reopen if needed.