run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.28k stars 5.18k forks source link

[Bug]: S3Reader: Failed to load files other than .txt using S3Reader #16602

Open Anitha6g opened 6 days ago

Anitha6g commented 6 days ago

Bug Description

I'm trying to read and load txt, pdf and docx files from s3 bucket using S3Reader. But getting this error for docx and pdf -

Failed to load file test-llamaindex/file1.docx with error: Attempt to open non key-like path: test-llamaindex\file1.docx. Skipping... Failed to load file test-llamaindex/file2.pdf with error: RetryError[<Future at 0x1122b0b42b0 state=finished raised ValueError>]. Skipping...

Version

llama_index Version: 0.11.18, llama-index-readers-s3 Version: 0.2.0

Steps to Reproduce

This is the code I'm using

`from llama_index.readers.s3 import S3Reader from llama_index.readers.file import PDFReader, DocxReader

reader = S3Reader( bucket="test-llamaindex", aws_access_id=AWS_ACCESS_KEY_ID, aws_access_secret=AWS_SECRET_ACCESS_KEY, aws_session_token=AWS_SESSION_TOKEN, recursive=True,
required_exts=[".pdf", ".docx", ".txt"], file_extractor={".pdf": PDFReader(), ".docx": DocxReader()} ) documents = reader.load_data() documents`

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 6 days ago

Hey @Anitha6g! 👋 I'm here to help you with any bugs, questions, or contributions you have. Let's squash those bugs together!

I found a similar open issue that might be related to the problem you're experiencing: [Bug]: Failed to load file using s3Reader and SimpleDirectoryReader with aws s3. This issue suggests ensuring that the SimpleDirectoryReader is correctly configured with the necessary parameters. It also provides an alternative solution using the BotoMinioReader class, which is specifically designed to handle S3 file and directory reading using boto3 [1].

In your case, it seems like the S3Reader is having trouble with non-text files like PDF and DOCX. Here are a few things you might want to check or try:

  1. File Paths: Ensure that the file paths in your S3 bucket are correctly specified and accessible. The error message "Attempt to open non key-like path" suggests there might be an issue with how the paths are being interpreted.

  2. RetryError: This could be related to network issues or incorrect configurations. Double-check your AWS credentials and network settings.

  3. Dependencies: Make sure that the necessary libraries for reading PDF and DOCX files are installed. The PDFReader requires pypdf, and the DocxReader requires docx2txt [2].

  4. File Extractor Configuration: Verify that the file_extractor is correctly mapping the file extensions to the appropriate reader classes.

If these suggestions don't resolve the issue, you might want to consider using the BotoMinioReader as an alternative approach, as mentioned in the similar issue.

To continue talking to Dosu, mention @dosu.

Anitha6g commented 6 days ago

@dosu: Re-checked file-path, credentials, dependencies - everything seem right. From my code can you please let me know if the file_extractor is correctly mapped

dosubot[bot] commented 6 days ago

The file_extractor in your code is correctly mapping the file extensions to the appropriate reader classes for PDF and DOCX files. The mapping is as follows:

file_extractor={".pdf": PDFReader(), ".docx": DocxReader()}

This indicates that PDF files (with the .pdf extension) are mapped to the PDFReader class, and DOCX files (with the .docx extension) are mapped to the DocxReader class [1].

To continue talking to Dosu, mention @dosu.

logan-markewich commented 6 days ago

@Anitha6g is it a password protected PDF? Seems like the PDFReader itself can't load it

logan-markewich commented 6 days ago

Could try using LlamaParse instead of PDFReader

Anitha6g commented 6 days ago

@logan-markewich : No, the file is not password protected

Anitha6g commented 6 days ago

@logan-markewich : I tried using LlamaParse as per your suggestion and a different pdf file. But the error seems to be there - Failed to load file test-llamaindex/anitha-version-FCCS.pdf with error: The input file_path must be a string or a list of strings.. Skipping...

This is my latest code : from llama_index.readers.s3 import S3Reader from llama_index.readers.file import PDFReader, DocxReader

from dotenv import load_dotenv load_dotenv()

bring in deps

from llama_parse import LlamaParse

set up parser

parser = LlamaParse( result_type="text" # "markdown" and "text" are available ) file_extractor = {".pdf": parser} reader = S3Reader( bucket="test-llamaindex", aws_access_id=AWS_ACCESS_KEY_ID, aws_access_secret=AWS_SECRET_ACCESS_KEY, aws_session_token=AWS_SESSION_TOKEN, recursive=True,
key="anitha-version-FCCS.pdf", required_exts=[".pdf"], file_extractor=file_extractor ) documents = reader.load_data() documents

Appreciate your help on this.

FYI: Same error with docx as well - Failed to load file test-llamaindex/filename.docx with error: The input file_path must be a string or a list of strings.. Skipping...

Anitha6g commented 6 days ago

@logan-markewich ISSUE: The issue is in the class PDFReader, in the below condition- """Parse file.""" if not isinstance(file, Path): file = Path(file) Since my file was of type <class 'pathlib.PurePosixPath'>, the IF condition passed and the Path(file) changes the "/" in the file path to "\" which is not supported by s3fs and gives this error - Attempt to open non key-like path: test-llamaindex\ShortStartStopMove.pdf in line with fs.open(str(file), "rb") as fp:

SOLUTION: Converting the file into posix path. """Parse file.""" if not isinstance(file, Path) and not isinstance(file, PurePosixPath): file = PurePosixPath(file)

@dosu: Please let me know the steps to create a PR to fix this.

dosubot[bot] commented 6 days ago

The issue you're encountering is due to the PDFReader class not handling PurePosixPath correctly, which results in incorrect path formatting for S3. Your solution to convert the file into a PurePosixPath is a good approach to maintain the correct path format.

To create a pull request (PR) to fix this issue, follow these general steps:

  1. Fork the Repository: Go to the LlamaIndex repository on GitHub and fork it to your own GitHub account.

  2. Clone Your Fork: Clone the forked repository to your local machine using:

    git clone https://github.com/YOUR_USERNAME/llama_index.git
  3. Create a New Branch: Navigate into the cloned repository and create a new branch for your fix:

    cd llama_index
    git checkout -b fix-pureposixpath-handling
  4. Make Your Changes: Open the relevant file in your code editor and apply your fix. Ensure that the PDFReader and any other relevant classes handle PurePosixPath correctly.

  5. Commit Your Changes: After making the changes, commit them with a descriptive message:

    git add .
    git commit -m "Fix handling of PurePosixPath in PDFReader"
  6. Push Your Changes: Push the changes to your forked repository:

    git push origin fix-pureposixpath-handling
  7. Create a Pull Request: Go to the original LlamaIndex repository on GitHub. You should see a prompt to create a pull request from your new branch. Click on it and provide a clear description of the changes and why they are necessary.

  8. Submit the PR: Once you've filled out the PR details, submit it for review.

Make sure to follow any contribution guidelines provided by the repository to ensure your PR is reviewed and accepted smoothly.

To continue talking to Dosu, mention @dosu.