Open seeifsalem opened 3 months ago
To address the issue with PDFReader
when using an S3 file due to Path()
casting, you can modify the load_data
method to handle S3 paths correctly. Here is a solution using the s3fs
library:
Ensure s3fs
is installed:
pip install s3fs
Modify the load_data
method in PDFReader
:
Update the method to handle S3 paths without casting them to Path
objects. Here is an example:
import s3fs
from PyPDF2 import PdfFileReader
import io
class PDFReader(BaseReader):
"""PDF parser."""
def __init__(self, return_full_document: Optional[bool] = False) -> None:
self.return_full_document = return_full_document
def load_data(self, file: str, extra_info: Optional[Dict] = None) -> List[Document]:
"""Parse file."""
try:
import pypdf
except ImportError:
raise ImportError("pypdf is required to read PDF files: `pip install pypdf`")
if file.startswith("s3://"):
fs = s3fs.S3FileSystem()
with fs.open(file, "rb") as fp:
pdf = pypdf.PdfReader(fp)
num_pages = len(pdf.pages)
docs = self._extract_text_from_pdf(pdf, num_pages, fp.name, extra_info)
else:
with open(file, "rb") as fp:
pdf = pypdf.PdfReader(fp)
num_pages = len(pdf.pages)
docs = self._extract_text_from_pdf(pdf, num_pages, fp.name, extra_info)
return docs
def _extract_text_from_pdf(self, pdf, num_pages, file_name, extra_info):
docs = []
if self.return_full_document:
text = ""
metadata = {"file_name": file_name}
for page in range(num_pages):
page_text = pdf.pages[page].extract_text()
text += page_text
if extra_info is not None:
metadata.update(extra_info)
docs.append(Document(text=text, metadata=metadata))
else:
for page in range(num_pages):
page_text = pdf.pages[page].extract_text()
page_label = pdf.page_labels[page]
metadata = {"page_label": page_label, "file_name": file_name}
if extra_info is not None:
metadata.update(extra_info)
docs.append(Document(text=page_text, metadata=metadata))
return docs
This approach ensures that S3 paths are handled correctly without being cast to Path
objects, thus avoiding the WindowsPath
issue [1].
To continue talking to Dosu, mention @dosu.
also suffering from this issue... short time fix would be much appreciated
This should already be fixed in latest versions imo?
There is a check _Path = Path if is_default_fs(self.fs) else PurePosixPath
PurePosixPath
will not return a path with incorrect slashes
The problem is rather here, in the load_file method: https://github.com/run-llama/llama_index/blob/17f23014953e07eb8f8e7690d4cca7fb26c2109c/llama-index-core/llama_index/core/readers/file/base.py#L540 The reader here is a PDFReader instance and this is where it messes up I believe: https://github.com/run-llama/llama_index/blob/17f23014953e07eb8f8e7690d4cca7fb26c2109c/llama-index-integrations/readers/llama-index-readers-file/llama_index/readers/file/docs/base.py#L46-L47 Since this is the last reader function call, it explains why what you sent doesn't work. Up until this line we did have a PosixPath.
Hi !
I'm encountering a similar issue, but with DirFileSystem
and not S3.
I'm reading the discussion and I have a question : Why would PurePosixPath
be compatible with fsspec
? fsspec seems to take strings, which makes sense given the variety of cases it aims to handle.
For example: I want to use fsspec's DirFileSystem
and I'm running into:
File "[...]/.venv/lib/python3.10/site-packages/fsspec/implementations/dirfs.py", line 61, in _join
return [self._join(_path) for _path in path]
TypeError: 'PurePosixPath' object is not iterable
They check if it's a string, if not they check if it's a dict, if not they assume it's an itterable like a list. Anyway, doesn't seem to be wrong what they're doing.
I don't understand why llamaindex is passing Path
(or PurePosixPath
...) to a library that is not designed to handle them. I don't think using PurePosixPath
is a good fix to this issue.
I think because the object never get initialize so when you call it result in error ? This way of doing is kind of hacky which is intialize the custom data reader when pass it into the file_extractor arguments
Bug Description
Hey, so to sum it up, I create a SimpleDirectoryReader with a PDFReader as an extractor and an s3 bucket as an input_dir, with also s3 as a fs. ThenI call load_data() which leads to the PDFReader's method load_data(file=input_dir). the input dir is in the s3 format but first thing in load_data are these 2 lines if not isinstance(file, Path): file = Path(file) This transforms it into a WindowsPath instead of a PosixPath and that doesn't work with S3. I think these two lines should be deleted or at least adapted to the filesystem at use PS: I opened this issue with the wrong github account minutes ago, don't mind that one, I closed it
Version
0.10.65
Steps to Reproduce
versions used are the most recent as of today 15.08.2024 for llama-index-readers-file llama-index and llama-index-core input_dir = f"{bucket}/{files_dir}" files_to_exclude = processed_files loader = SimpleDirectoryReader(input_dir=input_dir, exclude=files_to_exclude, file_extractor=extractors, fs=s3fs])
s3fs being an instance of s3fs.core.S3FileSystem
docs = loader.load_data()
Relevant Logs/Tracbacks