run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.78k stars 5.27k forks source link

[Bug]: Not all `llama_index.readers.file` Classes use fs argument when calling load_data #14529

Closed goyongi closed 4 months ago

goyongi commented 4 months ago

Bug Description

When loading files from S3 using llama-index-readers-s3, SimpleDirectoryReader is called with the s3fs fs argument that is passed down to the respective filetype readers in llama-index-readers-file.

Some file reader classes' load_data methods don't have fs as a keyword arg, and will return an error when through the s3 reader. For example, while PandasCSVReader works fine with s3fs, PandasExcelReader gives this error:

Failed to load file bucket/object.xlsx with error: PandasExcelReader.load_data() got an unexpected keyword argument 'fs'. Skipping...

After adding the fs argument to the function and opening the file with fs.open(file), the xlsx can be loaded

Version

0.10.51

Steps to Reproduce

While reading from s3 or localstack, use a filetype (like .xlsx) that doesn't support the fs argument

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 4 months ago

@goyongi Hello! I'm here to assist you with any bugs, questions, or contributions. Let me know how I can help while we wait for a human maintainer.

To address the issue where some file reader classes in llama-index-readers-file do not accept the fs argument, you need to modify the load_data method of those classes to include and handle the fs argument. Here's how you can update the PandasExcelReader class to support the fs argument:

  1. Modify the PandasExcelReader class: Update the load_data method to include the fs argument and handle file reading accordingly.
class PandasExcelReader(BaseReader):
    r"""Pandas-based Excel parser.

    Parses Excel files using the Pandas `read_excel` function.
    If special parameters are required, use the `pandas_config` dict.

    Args:
        concat_rows (bool): whether to concatenate all rows into one document.
            If set to False, a Document will be created for each row.
            True by default.

        sheet_name (str | int | None): Defaults to None, for all sheets, otherwise pass a str or int to specify the sheet to read.

        pandas_config (dict): Options for the `pandas.read_excel` function call.
            Refer to https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html
            for more information.
            Set to empty dict by default.

    """

    def __init__(
        self,
        *args: Any,
        concat_rows: bool = True,
        sheet_name=None,
        pandas_config: dict = {},
        **kwargs: Any
    ) -> None:
        """Init params."""
        super().__init__(*args, **kwargs)
        self._concat_rows = concat_rows
        self._sheet_name = sheet_name
        self._pandas_config = pandas_config

    def load_data(
        self, file: Path, extra_info: Optional[Dict] = None, fs: Optional[AbstractFileSystem] = None
    ) -> List[Document]:
        """Parse file."""
        openpyxl_spec = importlib.util.find_spec("openpyxl")
        if openpyxl_spec is not None:
            pass
        else:
            raise ImportError(
                "Please install openpyxl to read Excel files. You can install it with 'pip install openpyxl'"
            )

        if fs:
            with fs.open(file) as f:
                dfs = pd.read_excel(f, self._sheet_name, **self._pandas_config)
        else:
            dfs = pd.read_excel(file, self._sheet_name, **self._pandas_config)

        documents = []

        # handle the case where only a single DataFrame is returned
        if isinstance(dfs, pd.DataFrame):
            df = dfs.fillna("")

            # Convert DataFrame to list of rows
            text_list = (
                df.astype(str).apply(lambda row: " ".join(row.values), axis=1).tolist()
            )

            if self._concat_rows:
                documents.append(
                    Document(text="\n".join(text_list), metadata=extra_info or {})
                )
            else:
                documents.append(
                    [
                        Document(text=text, metadata=extra_info or {})
                        for text in text_list
                    ]
                )
        else:
            for df in dfs.values():
                df = df.fillna("")

                # Convert DataFrame to list of rows
                text_list = (
                    df.astype(str).apply(lambda row: " ".join(row), axis=1).tolist()
                )

                if self._concat_rows:
                    documents.append(
                        Document(text="\n".join(text_list), metadata=extra_info or {})
                    )
                else:
                    documents.append(
                        [
                            Document(text=text, metadata=extra_info or {})
                            for text in text_list
                        ]
                    )

        return documents

This modification ensures that the fs argument is passed to the PandasExcelReader class and used when opening files, resolving the error when loading .xlsx files from S3 [1][2].

To continue talking to Dosu, mention @dosu.