run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
37.06k stars 5.32k forks source link

[Bug]: impossible to use PDfReader with an S3 file because of Path() casting #15405

Closed Chris-SL-Apps closed 3 months ago

Chris-SL-Apps commented 3 months ago

Bug Description

Hey, so to sum it up, I create a SimpleDirectoryReader with a PDFReader as an extractor and an s3 bucket as an input_dir, with also s3 as a fs. ThenI call load_data() which leads to the PDFReader's method load_data(file=input_dir). the input dir is in the s3 format but first thing in load_data are these 2 lines if not isinstance(file, Path): file = Path(file) This transforms it into a WindowsPath instead of a PosixPath and that doesn't work with S3. I think these two lines should be deleted or at least adapted to the filesystem at use

Version

0.10.65

Steps to Reproduce

versions used are the most recent as of today 15.08.2024 for llama-index-readers-file llama-index and llama-index-core input_dir = f"{bucket}/{files_dir}" files_to_exclude = processed_files loader = SimpleDirectoryReader(input_dir=input_dir, exclude=files_to_exclude, file_extractor=extractors, fs=s3fs])

s3fs being an instance of s3fs.core.S3FileSystem

docs = loader.load_data()

Relevant Logs/Tracbacks

The error for one of the files, including additional traceback:

Failed to load file dtc-rag-lab/dtc-internal-test/files/TSLA-Q4-2021-Update.pdf with error: RetryError[<Future at 0x1b42c66de40 state=finished raised ValueError>]. Skipping...
Traceback (most recent call last):
  File "c:\Users\VM-User\Desktop\RagLab\dtc_rag_lab\venv\lib\site-packages\tenacity\__init__.py", line 478, in __call__
    result = fn(*args, **kwargs)
  File "c:\Users\VM-User\Desktop\RagLab\dtc_rag_lab\venv\lib\site-packages\llama_index\readers\file\docs\base.py", line 56, in load_data        
    with fs.open(str(file), "rb") as fp:
  File "c:\Users\VM-User\Desktop\RagLab\dtc_rag_lab\venv\lib\site-packages\fsspec\spec.py", line 1303, in open
    f = self._open(
  File "c:\Users\VM-User\Desktop\RagLab\dtc_rag_lab\venv\lib\site-packages\s3fs\core.py", line 688, in _open
    return S3File(
  File "c:\Users\VM-User\Desktop\RagLab\dtc_rag_lab\venv\lib\site-packages\s3fs\core.py", line 2155, in __init__
    raise ValueError("Attempt to open non key-like path: %s" % path)
ValueError: Attempt to open non key-like path: dtc-rag-lab\dtc-internal-test\files\TSLA-Q4-2021-Update.pdf

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\Users\VM-User\Desktop\RagLab\dtc_rag_lab\venv\lib\site-packages\llama_index\core\readers\file\base.py", line 540, in load_file       
    docs = reader.load_data(input_file, **kwargs)
  File "c:\Users\VM-User\Desktop\RagLab\dtc_rag_lab\venv\lib\site-packages\tenacity\__init__.py", line 336, in wrapped_f
    return copy(f, *args, **kw)
  File "c:\Users\VM-User\Desktop\RagLab\dtc_rag_lab\venv\lib\site-packages\tenacity\__init__.py", line 475, in __call__
    do = self.iter(retry_state=retry_state)
  File "c:\Users\VM-User\Desktop\RagLab\dtc_rag_lab\venv\lib\site-packages\tenacity\__init__.py", line 376, in iter
    result = action(retry_state)
  File "c:\Users\VM-User\Desktop\RagLab\dtc_rag_lab\venv\lib\site-packages\tenacity\__init__.py", line 419, in exc_check
    raise retry_exc from fut.exception()
tenacity.RetryError: RetryError[<Future at 0x1b42c66de40 state=finished raised ValueError>]
Chris-SL-Apps commented 3 months ago

oups sorry don't mind this