Closed goyongi closed 4 months ago
@goyongi Hello! I'm here to assist you with any bugs, questions, or contributions. Let me know how I can help while we wait for a human maintainer.
To address the issue where some file reader classes in llama-index-readers-file
do not accept the fs
argument, you need to modify the load_data
method of those classes to include and handle the fs
argument. Here's how you can update the PandasExcelReader
class to support the fs
argument:
PandasExcelReader
class:
Update the load_data
method to include the fs
argument and handle file reading accordingly.class PandasExcelReader(BaseReader):
r"""Pandas-based Excel parser.
Parses Excel files using the Pandas `read_excel` function.
If special parameters are required, use the `pandas_config` dict.
Args:
concat_rows (bool): whether to concatenate all rows into one document.
If set to False, a Document will be created for each row.
True by default.
sheet_name (str | int | None): Defaults to None, for all sheets, otherwise pass a str or int to specify the sheet to read.
pandas_config (dict): Options for the `pandas.read_excel` function call.
Refer to https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html
for more information.
Set to empty dict by default.
"""
def __init__(
self,
*args: Any,
concat_rows: bool = True,
sheet_name=None,
pandas_config: dict = {},
**kwargs: Any
) -> None:
"""Init params."""
super().__init__(*args, **kwargs)
self._concat_rows = concat_rows
self._sheet_name = sheet_name
self._pandas_config = pandas_config
def load_data(
self, file: Path, extra_info: Optional[Dict] = None, fs: Optional[AbstractFileSystem] = None
) -> List[Document]:
"""Parse file."""
openpyxl_spec = importlib.util.find_spec("openpyxl")
if openpyxl_spec is not None:
pass
else:
raise ImportError(
"Please install openpyxl to read Excel files. You can install it with 'pip install openpyxl'"
)
if fs:
with fs.open(file) as f:
dfs = pd.read_excel(f, self._sheet_name, **self._pandas_config)
else:
dfs = pd.read_excel(file, self._sheet_name, **self._pandas_config)
documents = []
# handle the case where only a single DataFrame is returned
if isinstance(dfs, pd.DataFrame):
df = dfs.fillna("")
# Convert DataFrame to list of rows
text_list = (
df.astype(str).apply(lambda row: " ".join(row.values), axis=1).tolist()
)
if self._concat_rows:
documents.append(
Document(text="\n".join(text_list), metadata=extra_info or {})
)
else:
documents.append(
[
Document(text=text, metadata=extra_info or {})
for text in text_list
]
)
else:
for df in dfs.values():
df = df.fillna("")
# Convert DataFrame to list of rows
text_list = (
df.astype(str).apply(lambda row: " ".join(row), axis=1).tolist()
)
if self._concat_rows:
documents.append(
Document(text="\n".join(text_list), metadata=extra_info or {})
)
else:
documents.append(
[
Document(text=text, metadata=extra_info or {})
for text in text_list
]
)
return documents
This modification ensures that the fs
argument is passed to the PandasExcelReader
class and used when opening files, resolving the error when loading .xlsx
files from S3 [1][2].
To continue talking to Dosu, mention @dosu.
Bug Description
When loading files from S3 using
llama-index-readers-s3
,SimpleDirectoryReader
is called with the s3fsfs
argument that is passed down to the respective filetype readers in llama-index-readers-file.Some file reader classes'
load_data
methods don't havefs
as a keyword arg, and will return an error when through the s3 reader. For example, whilePandasCSVReader
works fine with s3fs,PandasExcelReader
gives this error:After adding the
fs
argument to the function and opening the file withfs.open(file)
, the xlsx can be loadedVersion
0.10.51
Steps to Reproduce
While reading from s3 or localstack, use a filetype (like .xlsx) that doesn't support the fs argument
Relevant Logs/Tracbacks
No response