microsoft / graphrag

A modular graph-based Retrieval-Augmented Generation (RAG) system
https://microsoft.github.io/graphrag/
MIT License
17.32k stars 1.65k forks source link

The create_final_community_reports.parquet file disappears. #940

Open yangxue-1 opened 1 month ago

yangxue-1 commented 1 month ago

Is there an existing issue for this?

Describe the issue

The create_final_community_reports.parquet file is not generated after all processes of the index are executed.

Steps to reproduce

No response

GraphRAG Config Used

# Paste your config here

Logs and screenshots

No response

Additional Information

xgl0626 commented 1 month ago

I've been experiencing this bug for about a week now, and this error appears in the indexing-engine.log:

Traceback (most recent call last): File "/home/notebook/code/group/rag_reearch/graphrag-0.3.0/graphrag/index/emit/parquet_table_emitter.py", line 40, in emit await self._storage.set(filename, data.to_parquet()) File "/opt/conda/envs/graphrag/lib/python3.10/site-packages/pandas/util/_decorators.py", line 333, in wrapper return func(args, kwargs) File "/opt/conda/envs/graphrag/lib/python3.10/site-packages/pandas/core/frame.py", line 3113, in to_parquet return to_parquet( File "/opt/conda/envs/graphrag/lib/python3.10/site-packages/pandas/io/parquet.py", line 480, in to_parquet impl.write( File "/opt/conda/envs/graphrag/lib/python3.10/site-packages/pandas/io/parquet.py", line 190, in write table = self.api.Table.from_pandas(df, from_pandas_kwargs) File "pyarrow/table.pxi", line 3874, in pyarrow.lib.Table.from_pandas File "/opt/conda/envs/graphrag/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 624, in dataframe_to_arrays arrays[i] = maybe_fut.result() File "/opt/conda/envs/graphrag/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.get_result() File "/opt/conda/envs/graphrag/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception File "/opt/conda/envs/graphrag/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(self.args, **self.kwargs) File "/opt/conda/envs/graphrag/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 598, in convert_column raise e File "/opt/conda/envs/graphrag/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 592, in convertcolumn result = pa.array(col, type=type, from_pandas=True, safe=safe) File "pyarrow/array.pxi", line 339, in pyarrow.lib.array File "pyarrow/array.pxi", line 85, in pyarrow.lib._ndarray_to_array File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: ('cannot mix list and non-list, non-null values', 'Conversion failed for column findings with type object')

I've also re-executed on version 0.3.0 and still get errors

yangxue-1 commented 1 month ago

I've been experiencing this bug for about a week now, and this error appears in the indexing-engine.log:

Traceback (most recent call last): File "/home/notebook/code/group/rag_reearch/graphrag-0.3.0/graphrag/index/emit/parquet_table_emitter.py", line 40, in emit await self._storage.set(filename, data.to_parquet()) File "/opt/conda/envs/graphrag/lib/python3.10/site-packages/pandas/util/_decorators.py", line 333, in wrapper return func(args, kwargs) File "/opt/conda/envs/graphrag/lib/python3.10/site-packages/pandas/core/frame.py", line 3113, in to_parquet return to_parquet( File "/opt/conda/envs/graphrag/lib/python3.10/site-packages/pandas/io/parquet.py", line 480, in to_parquet impl.write( File "/opt/conda/envs/graphrag/lib/python3.10/site-packages/pandas/io/parquet.py", line 190, in write table = self.api.Table.from_pandas(df, from_pandas_kwargs) File "pyarrow/table.pxi", line 3874, in pyarrow.lib.Table.from_pandas File "/opt/conda/envs/graphrag/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 624, in dataframe_to_arrays arrays[i] = maybe_fut.result() File "/opt/conda/envs/graphrag/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.get_result() File "/opt/conda/envs/graphrag/lib/python3.10/concurrent/futures/_base.py", line 403, in get_result raise self._exception File "/opt/conda/envs/graphrag/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(self.args, **self.kwargs) File "/opt/conda/envs/graphrag/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 598, in convert_column raise e File "/opt/conda/envs/graphrag/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 592, in convertcolumn result = pa.array(col, type=type, from_pandas=True, safe=safe) File "pyarrow/array.pxi", line 339, in pyarrow.lib.array File "pyarrow/array.pxi", line 85, in pyarrow.lib._ndarray_to_array File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: ('cannot mix list and non-list, non-null values', 'Conversion failed for column findings with type object')

I've also re-executed on version 0.3.0 and still get errors

I found out that my problem might be a slight formatting error in the template.

xgl0626 commented 1 month ago

Thanks, I found that my report propmt and the official report propmt also have some format differences, I'm trying it out

我已经遇到这个错误大约一个星期了,这个错误出现在indexing-engine.log中: 回溯(最近一次调用最后一次): 文件“/home/notebook/code/group/rag_reearch/graphrag-0.3.0/graphrag/index/emit/parquet_table_emitter.py”,第 40 行,在 emit await self._storage.set(filename, data.to_parquet()) 文件“/opt/conda/envs/graphrag/lib/python3.10/site-packages/pandas/util/_decorators.py”,第 333 行,在包装器中返回 func(args, kwargs) 文件“/opt/conda/envs/graphrag/lib/python3.10/site-packages/pandas/core/frame.py”, 第 3113 行,在to_parquet返回 to_parquet( 文件“/opt/conda/envs/graphrag/lib/python3.10/site-packages/pandas/io/parquet.py”,第 480 行,to_parquet impl.write( 文件“/opt/conda/envs/graphrag/lib/python3.10/site-packages/pandas/io/parquet.py”,第 190 行,写入表 = self.api.Table.from_pandas(df, from_pandas_kwargs) 文件“pyarrow/table.pxi”,第 3874 行,pyarrow.lib.Table.from_pandas文件“/opt/conda/envs/graphrag/lib/python3.10/site-packages/pyarrow/pandas_compat.py“,第 624 行,在 dataframe_to_arrays 数组中[i] = maybe_fut.result() 文件”/opt/conda/envs/graphrag/lib/python3.10/concurrent/futures/_base.py“,第 451 行,在结果返回中 self.get_result() 文件”/opt/conda/envs/graphrag/lib/python3.10/concurrent/futures/_base.py“,第 403 行,get_result自。_异常文件“/opt/conda/envs/graphrag/lib/python3.10/concurrent/futures/thread.py”,第 58 行,运行结果 = self.fn(self.args, **self.kwargs) 文件“/opt/conda/envs/graphrag/lib/python3.10/site-packages/pyarrow/pandas_compat.py”,第 598 行,convert_column引发 e 文件“/opt/conda/envs/graphrag/lib/python3.10/site-packages/pyarrow/pandas_compat.py”,第 592 行,convertcolumn结果 = pa.array(col, type=type, from_pandas=True, safe=safe) 文件“pyarrow/array.pxi”,第 339 行,在 pyarrow.lib.array 中 文件“pyarrow/array.pxi”,第 85 行,在pyarrow.lib._ndarray_to_array 文件“pyarrow/error.pxi”,第 91 行,在pyarrow.lib.check_status中 pyarrow.lib.ArrowInvalid:(“无法混合列表和非列表,非空值”,“对象类型的列发现转换失败”) 我也在版本 0.3.0 上重新执行,但仍然收到错误

我发现我的问题可能是模板中的轻微格式错误。

Thanks, I found that my report propmt and the official report propmt also have some format differences, I'm trying it out

xxll88 commented 1 month ago

same problem ,how to resolve? 21:56:42,374 graphrag.index.emit.parquet_table_emitter INFO emitting parquet table create_final_community_reports.parquet 21:56:42,376 graphrag.index.emit.parquet_table_emitter ERROR Error while emitting parquet table Traceback (most recent call last): File "/home/lile/microsoft/graphrag/graphrag/index/emit/parquet_table_emitter.py", line 40, in emit await self._storage.set(filename, data.to_parquet()) ^^^^^^^^^^^^^^^^^ File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pandas/util/_decorators.py", line 333, in wrapper return func(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pandas/core/frame.py", line 3113, in to_parquet return to_parquet( ^^^^^^^^^^^ File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pandas/io/parquet.py", line 480, in to_parquet impl.write( File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pandas/io/parquet.py", line 190, in write table = self.api.Table.from_pandas(df, from_pandas_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "pyarrow/table.pxi", line 3874, in pyarrow.lib.Table.from_pandas File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 611, in dataframe_to_arrays arrays = [convert_column(c, f) ^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 611, in arrays = [convert_column(c, f) ^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 598, in convert_column raise e File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 592, in convertcolumn result = pa.array(col, type=type, from_pandas=True, safe=safe) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "pyarrow/array.pxi", line 339, in pyarrow.lib.array File "pyarrow/array.pxi", line 85, in pyarrow.lib._ndarray_to_array File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: ('cannot mix struct and non-struct, non-null values', 'Conversion failed for column findings with type object') 21:56:42,381 graphrag.index.reporting.file_workflow_callbacks INFO Error emitting table details=None 21:56:42,677 graphrag.index.run INFO Running workflow: create_final_text_units... 21:56:42,677 graphrag.index.run INFO dependencies for create_final_text_units: ['join_text_units_to_entity_ids', 'create_base_text_units', 'join_text_units_to_relationship_ids'] 21:56:42,677 graphrag.index.run INFO read table from storage: join_text_units_to_entity_ids.parquet

xgl0626 commented 1 month ago

same problem ,how to resolve? 21:56:42,374 graphrag.index.emit.parquet_table_emitter INFO emitting parquet table create_final_community_reports.parquet 21:56:42,376 graphrag.index.emit.parquet_table_emitter ERROR Error while emitting parquet table Traceback (most recent call last): File "/home/lile/microsoft/graphrag/graphrag/index/emit/parquet_table_emitter.py", line 40, in emit await self._storage.set(filename, data.to_parquet()) ^^^^^^^^^^^^^^^^^ File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pandas/util/_decorators.py", line 333, in wrapper return func(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pandas/core/frame.py", line 3113, in to_parquet return to_parquet( ^^^^^^^^^^^ File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pandas/io/parquet.py", line 480, in to_parquet impl.write( File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pandas/io/parquet.py", line 190, in write table = self.api.Table.from_pandas(df, from_pandas_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "pyarrow/table.pxi", line 3874, in pyarrow.lib.Table.from_pandas File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 611, in dataframe_to_arrays arrays = [convert_column(c, f) ^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 611, in arrays = [convert_column(c, f) ^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 598, in convert_column raise e File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 592, in convertcolumn result = pa.array(col, type=type, from_pandas=True, safe=safe) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "pyarrow/array.pxi", line 339, in pyarrow.lib.array File "pyarrow/array.pxi", line 85, in pyarrow.lib._ndarray_to_array File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: ('cannot mix struct and non-struct, non-null values', 'Conversion failed for column findings with type object') 21:56:42,381 graphrag.index.reporting.file_workflow_callbacks INFO Error emitting table details=None 21:56:42,677 graphrag.index.run INFO Running workflow: create_final_text_units... 21:56:42,677 graphrag.index.run INFO dependencies for create_final_text_units: ['join_text_units_to_entity_ids', 'create_base_text_units', 'join_text_units_to_relationship_ids'] 21:56:42,677 graphrag.index.run INFO read table from storage: join_text_units_to_entity_ids.parquet

I tried to modify the prompt, but it still reported an error, I don't know what the problem is, there will be no bug on a small data set, and it will not work if I replace it with a large data set.

File "/home/lile/microsoft/graphrag/graphrag/index/emit/parquet_table_emitter.py", line 40, in emit await self._storage.set(filename, data.to_parquet())

I'm planning to store the data in a csv at this line of code to see what the problem is

webZW commented 1 month ago

I tried to add the error blocking and error pocketing logic, the code is as follows, after processing it works fine graphrag.

# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License

"""ParquetTableEmitter module."""

import logging
import traceback

import pandas as pd
from pyarrow.lib import ArrowInvalid, ArrowTypeError

from graphrag.index.storage import PipelineStorage
from graphrag.index.typing import ErrorHandlerFn

from .table_emitter import TableEmitter

log = logging.getLogger(__name__)

class ParquetTableEmitter(TableEmitter):
    """ParquetTableEmitter class."""

    _storage: PipelineStorage
    _on_error: ErrorHandlerFn

    def __init__(
        self,
        storage: PipelineStorage,
        on_error: ErrorHandlerFn,
    ):
        """Create a new Parquet Table Emitter."""
        self._storage = storage
        self._on_error = on_error

    async def preprocess_and_emit(self, filename: str, data: pd.DataFrame) -> None:
        """Preprocess data and emit to storage."""
        def preprocess_findings_column(df):
            def ensure_struct(x):
                if isinstance(x, dict):
                    return x
                elif pd.isnull(x).any():
                    return None
                else:
                    return {'value': x}

            df['findings'] = df['findings'].apply(ensure_struct)
            return df

        # Apply preprocessing
        data = preprocess_findings_column(data)
        await self._storage.set(filename, data.to_parquet())

    async def emit(self, name: str, data: pd.DataFrame) -> None:
        """Emit a dataframe to storage."""
        filename = f"{name}.parquet"
        log.info("Emitting parquet table %s", filename)

        try:
            await self._storage.set(filename, data.to_parquet())
        except (ArrowTypeError, ArrowInvalid) as e:
            log.warning("Initial parquet save failed, preprocessing data and retrying due to error: %s", str(e))
            try:
                await self.preprocess_and_emit(filename, data)
            except Exception as ex:
                log.exception("Error while emitting parquet table after retry")
                self._on_error(
                    ex,
                    traceback.format_exc(),
                    None,
                )
        except Exception as e:
            log.exception("Unexpected error while emitting parquet table")
            self._on_error(
                e,
                traceback.format_exc(),
                None,
            )
xgl0626 commented 1 month ago

I tried to add the error blocking and error pocketing logic, the code is as follows, after processing it works fine graphrag.

Thanks for the reply, I also solved the problem by converting him to csv before, I'll try your method

therealcyberlord commented 1 month ago

Thanks everyone for your insights. Converting the pandas data frame to csv, then converting to parquet worked for me. However, I am getting a new issue:

You are trying to merge on int64 and object columns for key 'community'. If you wish to proceed you should use pd.concat

Update: solved the issue by casting column community to string in pandas

LingXuanYin commented 3 weeks ago

感谢 @xgl0626 和 @therealcyberlord ,我遇到了同样的问题,并如下更改了代码,现在一切运行正常了 image

try:
            open('./buf.csv','w+',encoding='UTF-8')
            data.to_csv('./buf.csv',encoding='UTF-8')
            data=pd.read_csv('./buf.csv',encoding='UTF-8')
            data['community']=data['community'].astype(str)

            await self._storage.set(filename, data.to_parquet())
            shutil.rmtree('./buf.csv')
except ArrowTypeError as e:
xxll88 commented 2 weeks ago

感谢 @xgl0626 和 @therealcyberlord ,我遇到了同样的问题,并如下更改了代码,现在一切运行正常了 image

try:
            open('./buf.csv','w+',encoding='UTF-8')
            data.to_csv('./buf.csv',encoding='UTF-8')
            data=pd.read_csv('./buf.csv',encoding='UTF-8')
            data['community']=data['community'].astype(str)

            await self._storage.set(filename, data.to_parquet())
            shutil.rmtree('./buf.csv')
except ArrowTypeError as e:

Thanks for help , The create_final_community_reports.parquet file has been created. but when local search:

File "/home/lile/graphrag/graphrag/query/api.py", line 272, in local_search_streaming _entities = read_indexer_entities(nodes, entities, community_level) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lile/graphrag/graphrag/query/indexer_adapters.py", line 105, in read_indexer_entities entity_df["community"] = entity_df["community"].astype(int) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pandas/core/generic.py", line 664 3, in astype new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pandas/core/internals/managers.py ", line 430, in astype return self.apply( ^^^^^^^^^^^ File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pandas/core/internals/managers.py ", line 363, in apply applied = getattr(b, f)(**kwargs) ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pandas/core/internals/blocks.py", line 758, in astype new_values = astype_array_safe(values, dtype, copy=copy, errors=errors) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pandas/core/dtypes/astype.py", li ne 237, in astype_array_safe new_values = astype_array(values, dtype, copy=copy) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pandas/core/dtypes/astype.py", li ne 182, in astype_array values = _astype_nansafe(values, dtype, copy=copy) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/mambaforge/envs/graphrag/lib/python3.11/site-packages/pandas/core/dtypes/astype.py", li ne 133, in _astype_nansafe return arr.astype(dtype, copy=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ValueError: invalid literal for int() with base 10: '4.0'