Open yangxue-1 opened 1 month ago
I've been experiencing this bug for about a week now, and this error appears in the indexing-engine.log:
Traceback (most recent call last): File "/home/notebook/code/group/rag_reearch/graphrag-0.3.0/graphrag/index/emit/parquet_table_emitter.py", line 40, in emit await self._storage.set(filename, data.to_parquet()) File "/opt/conda/envs/graphrag/lib/python3.10/site-packages/pandas/util/_decorators.py", line 333, in wrapper return func(args, kwargs) File "/opt/conda/envs/graphrag/lib/python3.10/site-packages/pandas/core/frame.py", line 3113, in to_parquet return to_parquet( File "/opt/conda/envs/graphrag/lib/python3.10/site-packages/pandas/io/parquet.py", line 480, in to_parquet impl.write( File "/opt/conda/envs/graphrag/lib/python3.10/site-packages/pandas/io/parquet.py", line 190, in write table = self.api.Table.from_pandas(df, from_pandas_kwargs) File "pyarrow/table.pxi", line 3874, in pyarrow.lib.Table.from_pandas File "/opt/conda/envs/graphrag/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 624, in dataframe_to_arrays arrays[i] = maybe_fut.result() File "/opt/conda/envs/graphrag/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.get_result() File "/opt/conda/envs/graphrag/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception File "/opt/conda/envs/graphrag/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(self.args, **self.kwargs) File "/opt/conda/envs/graphrag/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 598, in convert_column raise e File "/opt/conda/envs/graphrag/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 592, in convertcolumn result = pa.array(col, type=type, from_pandas=True, safe=safe) File "pyarrow/array.pxi", line 339, in pyarrow.lib.array File "pyarrow/array.pxi", line 85, in pyarrow.lib._ndarray_to_array File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: ('cannot mix list and non-list, non-null values', 'Conversion failed for column findings with type object')
I've also re-executed on version 0.3.0 and still get errors
I've been experiencing this bug for about a week now, and this error appears in the indexing-engine.log:
Traceback (most recent call last): File "/home/notebook/code/group/rag_reearch/graphrag-0.3.0/graphrag/index/emit/parquet_table_emitter.py", line 40, in emit await self._storage.set(filename, data.to_parquet()) File "/opt/conda/envs/graphrag/lib/python3.10/site-packages/pandas/util/_decorators.py", line 333, in wrapper return func(args, kwargs) File "/opt/conda/envs/graphrag/lib/python3.10/site-packages/pandas/core/frame.py", line 3113, in to_parquet return to_parquet( File "/opt/conda/envs/graphrag/lib/python3.10/site-packages/pandas/io/parquet.py", line 480, in to_parquet impl.write( File "/opt/conda/envs/graphrag/lib/python3.10/site-packages/pandas/io/parquet.py", line 190, in write table = self.api.Table.from_pandas(df, from_pandas_kwargs) File "pyarrow/table.pxi", line 3874, in pyarrow.lib.Table.from_pandas File "/opt/conda/envs/graphrag/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 624, in dataframe_to_arrays arrays[i] = maybe_fut.result() File "/opt/conda/envs/graphrag/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.get_result() File "/opt/conda/envs/graphrag/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception File "/opt/conda/envs/graphrag/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(self.args, **self.kwargs) File "/opt/conda/envs/graphrag/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 598, in convert_column raise e File "/opt/conda/envs/graphrag/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 592, in convertcolumn result = pa.array(col, type=type, from_pandas=True, safe=safe) File "pyarrow/array.pxi", line 339, in pyarrow.lib.array File "pyarrow/array.pxi", line 85, in pyarrow.lib._ndarray_to_array File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: ('cannot mix list and non-list, non-null values', 'Conversion failed for column findings with type object')
I've also re-executed on version 0.3.0 and still get errors
I found out that my problem might be a slight formatting error in the template.
Thanks, I found that my report propmt and the official report propmt also have some format differences, I'm trying it out
我已经遇到这个错误大约一个星期了,这个错误出现在indexing-engine.log中: 回溯(最近一次调用最后一次): 文件“/home/notebook/code/group/rag_reearch/graphrag-0.3.0/graphrag/index/emit/parquet_table_emitter.py”,第 40 行,在 emit await self._storage.set(filename, data.to_parquet()) 文件“/opt/conda/envs/graphrag/lib/python3.10/site-packages/pandas/util/_decorators.py”,第 333 行,在包装器中返回 func(args, kwargs) 文件“/opt/conda/envs/graphrag/lib/python3.10/site-packages/pandas/core/frame.py”, 第 3113 行,在to_parquet返回 to_parquet( 文件“/opt/conda/envs/graphrag/lib/python3.10/site-packages/pandas/io/parquet.py”,第 480 行,to_parquet impl.write( 文件“/opt/conda/envs/graphrag/lib/python3.10/site-packages/pandas/io/parquet.py”,第 190 行,写入表 = self.api.Table.from_pandas(df, from_pandas_kwargs) 文件“pyarrow/table.pxi”,第 3874 行,pyarrow.lib.Table.from_pandas文件“/opt/conda/envs/graphrag/lib/python3.10/site-packages/pyarrow/pandas_compat.py“,第 624 行,在 dataframe_to_arrays 数组中[i] = maybe_fut.result() 文件”/opt/conda/envs/graphrag/lib/python3.10/concurrent/futures/_base.py“,第 451 行,在结果返回中 self.get_result() 文件”/opt/conda/envs/graphrag/lib/python3.10/concurrent/futures/_base.py“,第 403 行,get_result自。_异常文件“/opt/conda/envs/graphrag/lib/python3.10/concurrent/futures/thread.py”,第 58 行,运行结果 = self.fn(self.args, **self.kwargs) 文件“/opt/conda/envs/graphrag/lib/python3.10/site-packages/pyarrow/pandas_compat.py”,第 598 行,convert_column引发 e 文件“/opt/conda/envs/graphrag/lib/python3.10/site-packages/pyarrow/pandas_compat.py”,第 592 行,convertcolumn结果 = pa.array(col, type=type, from_pandas=True, safe=safe) 文件“pyarrow/array.pxi”,第 339 行,在 pyarrow.lib.array 中 文件“pyarrow/array.pxi”,第 85 行,在pyarrow.lib._ndarray_to_array 文件“pyarrow/error.pxi”,第 91 行,在pyarrow.lib.check_status中 pyarrow.lib.ArrowInvalid:(“无法混合列表和非列表,非空值”,“对象类型的列发现转换失败”) 我也在版本 0.3.0 上重新执行,但仍然收到错误
我发现我的问题可能是模板中的轻微格式错误。
Thanks, I found that my report propmt and the official report propmt also have some format differences, I'm trying it out
same problem ,how to resolve?
21:56:42,374 graphrag.index.emit.parquet_table_emitter INFO emitting parquet table create_final_community_reports.parquet
21:56:42,376 graphrag.index.emit.parquet_table_emitter ERROR Error while emitting parquet table
Traceback (most recent call last):
File "/home/lile/microsoft/graphrag/graphrag/index/emit/parquet_table_emitter.py", line 40, in emit
await self._storage.set(filename, data.to_parquet())
^^^^^^^^^^^^^^^^^
File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pandas/util/_decorators.py", line 333, in wrapper
return func(*args, kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pandas/core/frame.py", line 3113, in to_parquet
return to_parquet(
^^^^^^^^^^^
File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pandas/io/parquet.py", line 480, in to_parquet
impl.write(
File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pandas/io/parquet.py", line 190, in write
table = self.api.Table.from_pandas(df, from_pandas_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/table.pxi", line 3874, in pyarrow.lib.Table.from_pandas
File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 611, in dataframe_to_arrays
arrays = [convert_column(c, f)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 611, in
same problem ,how to resolve? 21:56:42,374 graphrag.index.emit.parquet_table_emitter INFO emitting parquet table create_final_community_reports.parquet 21:56:42,376 graphrag.index.emit.parquet_table_emitter ERROR Error while emitting parquet table Traceback (most recent call last): File "/home/lile/microsoft/graphrag/graphrag/index/emit/parquet_table_emitter.py", line 40, in emit await self._storage.set(filename, data.to_parquet()) ^^^^^^^^^^^^^^^^^ File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pandas/util/_decorators.py", line 333, in wrapper return func(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pandas/core/frame.py", line 3113, in to_parquet return to_parquet( ^^^^^^^^^^^ File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pandas/io/parquet.py", line 480, in to_parquet impl.write( File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pandas/io/parquet.py", line 190, in write table = self.api.Table.from_pandas(df, from_pandas_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "pyarrow/table.pxi", line 3874, in pyarrow.lib.Table.from_pandas File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 611, in dataframe_to_arrays arrays = [convert_column(c, f) ^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 611, in arrays = [convert_column(c, f) ^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 598, in convert_column raise e File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 592, in convertcolumn result = pa.array(col, type=type, from_pandas=True, safe=safe) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "pyarrow/array.pxi", line 339, in pyarrow.lib.array File "pyarrow/array.pxi", line 85, in pyarrow.lib._ndarray_to_array File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: ('cannot mix struct and non-struct, non-null values', 'Conversion failed for column findings with type object') 21:56:42,381 graphrag.index.reporting.file_workflow_callbacks INFO Error emitting table details=None 21:56:42,677 graphrag.index.run INFO Running workflow: create_final_text_units... 21:56:42,677 graphrag.index.run INFO dependencies for create_final_text_units: ['join_text_units_to_entity_ids', 'create_base_text_units', 'join_text_units_to_relationship_ids'] 21:56:42,677 graphrag.index.run INFO read table from storage: join_text_units_to_entity_ids.parquet
I tried to modify the prompt, but it still reported an error, I don't know what the problem is, there will be no bug on a small data set, and it will not work if I replace it with a large data set.
File "/home/lile/microsoft/graphrag/graphrag/index/emit/parquet_table_emitter.py", line 40, in emit await self._storage.set(filename, data.to_parquet())
I'm planning to store the data in a csv at this line of code to see what the problem is
I tried to add the error blocking and error pocketing logic, the code is as follows, after processing it works fine graphrag.
# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License
"""ParquetTableEmitter module."""
import logging
import traceback
import pandas as pd
from pyarrow.lib import ArrowInvalid, ArrowTypeError
from graphrag.index.storage import PipelineStorage
from graphrag.index.typing import ErrorHandlerFn
from .table_emitter import TableEmitter
log = logging.getLogger(__name__)
class ParquetTableEmitter(TableEmitter):
"""ParquetTableEmitter class."""
_storage: PipelineStorage
_on_error: ErrorHandlerFn
def __init__(
self,
storage: PipelineStorage,
on_error: ErrorHandlerFn,
):
"""Create a new Parquet Table Emitter."""
self._storage = storage
self._on_error = on_error
async def preprocess_and_emit(self, filename: str, data: pd.DataFrame) -> None:
"""Preprocess data and emit to storage."""
def preprocess_findings_column(df):
def ensure_struct(x):
if isinstance(x, dict):
return x
elif pd.isnull(x).any():
return None
else:
return {'value': x}
df['findings'] = df['findings'].apply(ensure_struct)
return df
# Apply preprocessing
data = preprocess_findings_column(data)
await self._storage.set(filename, data.to_parquet())
async def emit(self, name: str, data: pd.DataFrame) -> None:
"""Emit a dataframe to storage."""
filename = f"{name}.parquet"
log.info("Emitting parquet table %s", filename)
try:
await self._storage.set(filename, data.to_parquet())
except (ArrowTypeError, ArrowInvalid) as e:
log.warning("Initial parquet save failed, preprocessing data and retrying due to error: %s", str(e))
try:
await self.preprocess_and_emit(filename, data)
except Exception as ex:
log.exception("Error while emitting parquet table after retry")
self._on_error(
ex,
traceback.format_exc(),
None,
)
except Exception as e:
log.exception("Unexpected error while emitting parquet table")
self._on_error(
e,
traceback.format_exc(),
None,
)
I tried to add the error blocking and error pocketing logic, the code is as follows, after processing it works fine graphrag.
Thanks for the reply, I also solved the problem by converting him to csv before, I'll try your method
Thanks everyone for your insights. Converting the pandas data frame to csv, then converting to parquet worked for me. However, I am getting a new issue:
You are trying to merge on int64 and object columns for key 'community'. If you wish to proceed you should use pd.concat
Update: solved the issue by casting column community to string in pandas
感谢 @xgl0626 和 @therealcyberlord ,我遇到了同样的问题,并如下更改了代码,现在一切运行正常了
try:
open('./buf.csv','w+',encoding='UTF-8')
data.to_csv('./buf.csv',encoding='UTF-8')
data=pd.read_csv('./buf.csv',encoding='UTF-8')
data['community']=data['community'].astype(str)
await self._storage.set(filename, data.to_parquet())
shutil.rmtree('./buf.csv')
except ArrowTypeError as e:
感谢 @xgl0626 和 @therealcyberlord ,我遇到了同样的问题,并如下更改了代码,现在一切运行正常了
try: open('./buf.csv','w+',encoding='UTF-8') data.to_csv('./buf.csv',encoding='UTF-8') data=pd.read_csv('./buf.csv',encoding='UTF-8') data['community']=data['community'].astype(str) await self._storage.set(filename, data.to_parquet()) shutil.rmtree('./buf.csv') except ArrowTypeError as e:
Thanks for help , The create_final_community_reports.parquet file has been created. but when local search:
File "/home/lile/graphrag/graphrag/query/api.py", line 272, in local_search_streaming _entities = read_indexer_entities(nodes, entities, community_level) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lile/graphrag/graphrag/query/indexer_adapters.py", line 105, in read_indexer_entities entity_df["community"] = entity_df["community"].astype(int) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pandas/core/generic.py", line 664 3, in astype new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pandas/core/internals/managers.py ", line 430, in astype return self.apply( ^^^^^^^^^^^ File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pandas/core/internals/managers.py ", line 363, in apply applied = getattr(b, f)(**kwargs) ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pandas/core/internals/blocks.py", line 758, in astype new_values = astype_array_safe(values, dtype, copy=copy, errors=errors) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pandas/core/dtypes/astype.py", li ne 237, in astype_array_safe new_values = astype_array(values, dtype, copy=copy) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pandas/core/dtypes/astype.py", li ne 182, in astype_array values = _astype_nansafe(values, dtype, copy=copy) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pandas/core/dtypes/astype.py", li ne 133, in _astype_nansafe return arr.astype(dtype, copy=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ValueError: invalid literal for int() with base 10: '4.0'
Is there an existing issue for this?
Describe the issue
The create_final_community_reports.parquet file is not generated after all processes of the index are executed.
Steps to reproduce
No response
GraphRAG Config Used
Logs and screenshots
No response
Additional Information