run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.69k stars 5.26k forks source link

[Bug]: impossible to use PDfReader with an S3 file because of Path() casting #15406

Open seeifsalem opened 3 months ago

seeifsalem commented 3 months ago

Bug Description

Hey, so to sum it up, I create a SimpleDirectoryReader with a PDFReader as an extractor and an s3 bucket as an input_dir, with also s3 as a fs. ThenI call load_data() which leads to the PDFReader's method load_data(file=input_dir). the input dir is in the s3 format but first thing in load_data are these 2 lines if not isinstance(file, Path): file = Path(file) This transforms it into a WindowsPath instead of a PosixPath and that doesn't work with S3. I think these two lines should be deleted or at least adapted to the filesystem at use PS: I opened this issue with the wrong github account minutes ago, don't mind that one, I closed it

Version

0.10.65

Steps to Reproduce

versions used are the most recent as of today 15.08.2024 for llama-index-readers-file llama-index and llama-index-core input_dir = f"{bucket}/{files_dir}" files_to_exclude = processed_files loader = SimpleDirectoryReader(input_dir=input_dir, exclude=files_to_exclude, file_extractor=extractors, fs=s3fs])

s3fs being an instance of s3fs.core.S3FileSystem

docs = loader.load_data()

Relevant Logs/Tracbacks

The error for one of the files, including additional traceback:

Failed to load file dtc-rag-lab/dtc-internal-test/files/TSLA-Q4-2021-Update.pdf with error: RetryError[<Future at 0x1b42c66de40 state=finished raised ValueError>]. Skipping...
Traceback (most recent call last):
  File "c:\Users\VM-User\Desktop\RagLab\dtc_rag_lab\venv\lib\site-packages\tenacity\__init__.py", line 478, in __call__
    result = fn(*args, **kwargs)
  File "c:\Users\VM-User\Desktop\RagLab\dtc_rag_lab\venv\lib\site-packages\llama_index\readers\file\docs\base.py", line 56, in load_data        
    with fs.open(str(file), "rb") as fp:
  File "c:\Users\VM-User\Desktop\RagLab\dtc_rag_lab\venv\lib\site-packages\fsspec\spec.py", line 1303, in open
    f = self._open(
  File "c:\Users\VM-User\Desktop\RagLab\dtc_rag_lab\venv\lib\site-packages\s3fs\core.py", line 688, in _open
    return S3File(
  File "c:\Users\VM-User\Desktop\RagLab\dtc_rag_lab\venv\lib\site-packages\s3fs\core.py", line 2155, in __init__
    raise ValueError("Attempt to open non key-like path: %s" % path)
ValueError: Attempt to open non key-like path: dtc-rag-lab\dtc-internal-test\files\TSLA-Q4-2021-Update.pdf

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\Users\VM-User\Desktop\RagLab\dtc_rag_lab\venv\lib\site-packages\llama_index\core\readers\file\base.py", line 540, in load_file       
    docs = reader.load_data(input_file, **kwargs)
  File "c:\Users\VM-User\Desktop\RagLab\dtc_rag_lab\venv\lib\site-packages\tenacity\__init__.py", line 336, in wrapped_f
    return copy(f, *args, **kw)
  File "c:\Users\VM-User\Desktop\RagLab\dtc_rag_lab\venv\lib\site-packages\tenacity\__init__.py", line 475, in __call__
    do = self.iter(retry_state=retry_state)
  File "c:\Users\VM-User\Desktop\RagLab\dtc_rag_lab\venv\lib\site-packages\tenacity\__init__.py", line 376, in iter
    result = action(retry_state)
  File "c:\Users\VM-User\Desktop\RagLab\dtc_rag_lab\venv\lib\site-packages\tenacity\__init__.py", line 419, in exc_check
    raise retry_exc from fut.exception()
tenacity.RetryError: RetryError[<Future at 0x1b42c66de40 state=finished raised ValueError>]
dosubot[bot] commented 3 months ago

To address the issue with PDFReader when using an S3 file due to Path() casting, you can modify the load_data method to handle S3 paths correctly. Here is a solution using the s3fs library:

  1. Ensure s3fs is installed:

    pip install s3fs
  2. Modify the load_data method in PDFReader: Update the method to handle S3 paths without casting them to Path objects. Here is an example:

    import s3fs
    from PyPDF2 import PdfFileReader
    import io
    
    class PDFReader(BaseReader):
       """PDF parser."""
    
       def __init__(self, return_full_document: Optional[bool] = False) -> None:
           self.return_full_document = return_full_document
    
       def load_data(self, file: str, extra_info: Optional[Dict] = None) -> List[Document]:
           """Parse file."""
           try:
               import pypdf
           except ImportError:
               raise ImportError("pypdf is required to read PDF files: `pip install pypdf`")
    
           if file.startswith("s3://"):
               fs = s3fs.S3FileSystem()
               with fs.open(file, "rb") as fp:
                   pdf = pypdf.PdfReader(fp)
                   num_pages = len(pdf.pages)
                   docs = self._extract_text_from_pdf(pdf, num_pages, fp.name, extra_info)
           else:
               with open(file, "rb") as fp:
                   pdf = pypdf.PdfReader(fp)
                   num_pages = len(pdf.pages)
                   docs = self._extract_text_from_pdf(pdf, num_pages, fp.name, extra_info)
    
           return docs
    
       def _extract_text_from_pdf(self, pdf, num_pages, file_name, extra_info):
           docs = []
           if self.return_full_document:
               text = ""
               metadata = {"file_name": file_name}
               for page in range(num_pages):
                   page_text = pdf.pages[page].extract_text()
                   text += page_text
               if extra_info is not None:
                   metadata.update(extra_info)
               docs.append(Document(text=text, metadata=metadata))
           else:
               for page in range(num_pages):
                   page_text = pdf.pages[page].extract_text()
                   page_label = pdf.page_labels[page]
                   metadata = {"page_label": page_label, "file_name": file_name}
                   if extra_info is not None:
                       metadata.update(extra_info)
                   docs.append(Document(text=page_text, metadata=metadata))
           return docs

This approach ensures that S3 paths are handled correctly without being cast to Path objects, thus avoiding the WindowsPath issue [1].

To continue talking to Dosu, mention @dosu.

Chris-SL-Apps commented 2 months ago

also suffering from this issue... short time fix would be much appreciated

logan-markewich commented 2 months ago

This should already be fixed in latest versions imo?

There is a check _Path = Path if is_default_fs(self.fs) else PurePosixPath

PurePosixPath will not return a path with incorrect slashes

https://github.com/run-llama/llama_index/blob/17f23014953e07eb8f8e7690d4cca7fb26c2109c/llama-index-core/llama_index/core/readers/file/base.py#L249

https://github.com/run-llama/llama_index/blob/17f23014953e07eb8f8e7690d4cca7fb26c2109c/llama-index-core/llama_index/core/readers/file/base.py#L284

seeifsalem commented 2 months ago

The problem is rather here, in the load_file method: https://github.com/run-llama/llama_index/blob/17f23014953e07eb8f8e7690d4cca7fb26c2109c/llama-index-core/llama_index/core/readers/file/base.py#L540 The reader here is a PDFReader instance and this is where it messes up I believe: https://github.com/run-llama/llama_index/blob/17f23014953e07eb8f8e7690d4cca7fb26c2109c/llama-index-integrations/readers/llama-index-readers-file/llama_index/readers/file/docs/base.py#L46-L47 Since this is the last reader function call, it explains why what you sent doesn't work. Up until this line we did have a PosixPath.

wapiflapi commented 1 month ago

Hi !

I'm encountering a similar issue, but with DirFileSystem and not S3.

I'm reading the discussion and I have a question : Why would PurePosixPath be compatible with fsspec ? fsspec seems to take strings, which makes sense given the variety of cases it aims to handle.

For example: I want to use fsspec's DirFileSystem and I'm running into:

  File "[...]/.venv/lib/python3.10/site-packages/fsspec/implementations/dirfs.py", line 61, in _join
    return [self._join(_path) for _path in path]
TypeError: 'PurePosixPath' object is not iterable

They check if it's a string, if not they check if it's a dict, if not they assume it's an itterable like a list. Anyway, doesn't seem to be wrong what they're doing.

I don't understand why llamaindex is passing Path (or PurePosixPath ...) to a library that is not designed to handle them. I don't think using PurePosixPath is a good fix to this issue.

NguyenTuan9347 commented 1 month ago

image I think because the object never get initialize so when you call it result in error ? This way of doing is kind of hacky which is intialize the custom data reader when pass it into the file_extractor arguments image