Hey, so to sum it up, I create a SimpleDirectoryReader with a PDFReader as an extractor and an s3 bucket as an input_dir, with also s3 as a fs.
ThenI call load_data() which leads to the PDFReader's method load_data(file=input_dir). the input dir is in the s3 format but first thing in load_data are these 2 lines
if not isinstance(file, Path):
file = Path(file)
This transforms it into a WindowsPath instead of a PosixPath and that doesn't work with S3. I think these two lines should be deleted or at least adapted to the filesystem at use
Version
0.10.65
Steps to Reproduce
versions used are the most recent as of today 15.08.2024 for llama-index-readers-file llama-index and llama-index-core
input_dir = f"{bucket}/{files_dir}"
files_to_exclude = processed_files
loader = SimpleDirectoryReader(input_dir=input_dir, exclude=files_to_exclude, file_extractor=extractors, fs=s3fs])
s3fs being an instance of s3fs.core.S3FileSystem
docs = loader.load_data()
Relevant Logs/Tracbacks
The error for one of the files, including additional traceback:
Failed to load file dtc-rag-lab/dtc-internal-test/files/TSLA-Q4-2021-Update.pdf with error: RetryError[<Future at 0x1b42c66de40 state=finished raised ValueError>]. Skipping...
Traceback (most recent call last):
File "c:\Users\VM-User\Desktop\RagLab\dtc_rag_lab\venv\lib\site-packages\tenacity\__init__.py", line 478, in __call__
result = fn(*args, **kwargs)
File "c:\Users\VM-User\Desktop\RagLab\dtc_rag_lab\venv\lib\site-packages\llama_index\readers\file\docs\base.py", line 56, in load_data
with fs.open(str(file), "rb") as fp:
File "c:\Users\VM-User\Desktop\RagLab\dtc_rag_lab\venv\lib\site-packages\fsspec\spec.py", line 1303, in open
f = self._open(
File "c:\Users\VM-User\Desktop\RagLab\dtc_rag_lab\venv\lib\site-packages\s3fs\core.py", line 688, in _open
return S3File(
File "c:\Users\VM-User\Desktop\RagLab\dtc_rag_lab\venv\lib\site-packages\s3fs\core.py", line 2155, in __init__
raise ValueError("Attempt to open non key-like path: %s" % path)
ValueError: Attempt to open non key-like path: dtc-rag-lab\dtc-internal-test\files\TSLA-Q4-2021-Update.pdf
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "c:\Users\VM-User\Desktop\RagLab\dtc_rag_lab\venv\lib\site-packages\llama_index\core\readers\file\base.py", line 540, in load_file
docs = reader.load_data(input_file, **kwargs)
File "c:\Users\VM-User\Desktop\RagLab\dtc_rag_lab\venv\lib\site-packages\tenacity\__init__.py", line 336, in wrapped_f
return copy(f, *args, **kw)
File "c:\Users\VM-User\Desktop\RagLab\dtc_rag_lab\venv\lib\site-packages\tenacity\__init__.py", line 475, in __call__
do = self.iter(retry_state=retry_state)
File "c:\Users\VM-User\Desktop\RagLab\dtc_rag_lab\venv\lib\site-packages\tenacity\__init__.py", line 376, in iter
result = action(retry_state)
File "c:\Users\VM-User\Desktop\RagLab\dtc_rag_lab\venv\lib\site-packages\tenacity\__init__.py", line 419, in exc_check
raise retry_exc from fut.exception()
tenacity.RetryError: RetryError[<Future at 0x1b42c66de40 state=finished raised ValueError>]
Bug Description
Hey, so to sum it up, I create a SimpleDirectoryReader with a PDFReader as an extractor and an s3 bucket as an input_dir, with also s3 as a fs. ThenI call load_data() which leads to the PDFReader's method load_data(file=input_dir). the input dir is in the s3 format but first thing in load_data are these 2 lines if not isinstance(file, Path): file = Path(file) This transforms it into a WindowsPath instead of a PosixPath and that doesn't work with S3. I think these two lines should be deleted or at least adapted to the filesystem at use
Version
0.10.65
Steps to Reproduce
versions used are the most recent as of today 15.08.2024 for llama-index-readers-file llama-index and llama-index-core input_dir = f"{bucket}/{files_dir}" files_to_exclude = processed_files loader = SimpleDirectoryReader(input_dir=input_dir, exclude=files_to_exclude, file_extractor=extractors, fs=s3fs])
s3fs being an instance of s3fs.core.S3FileSystem
docs = loader.load_data()
Relevant Logs/Tracbacks