Open Anitha6g opened 6 days ago
Hey @Anitha6g! 👋 I'm here to help you with any bugs, questions, or contributions you have. Let's squash those bugs together!
I found a similar open issue that might be related to the problem you're experiencing: [Bug]: Failed to load file using s3Reader and SimpleDirectoryReader with aws s3. This issue suggests ensuring that the SimpleDirectoryReader
is correctly configured with the necessary parameters. It also provides an alternative solution using the BotoMinioReader
class, which is specifically designed to handle S3 file and directory reading using boto3
[1].
In your case, it seems like the S3Reader
is having trouble with non-text files like PDF and DOCX. Here are a few things you might want to check or try:
File Paths: Ensure that the file paths in your S3 bucket are correctly specified and accessible. The error message "Attempt to open non key-like path" suggests there might be an issue with how the paths are being interpreted.
RetryError: This could be related to network issues or incorrect configurations. Double-check your AWS credentials and network settings.
Dependencies: Make sure that the necessary libraries for reading PDF and DOCX files are installed. The PDFReader
requires pypdf
, and the DocxReader
requires docx2txt
[2].
File Extractor Configuration: Verify that the file_extractor
is correctly mapping the file extensions to the appropriate reader classes.
If these suggestions don't resolve the issue, you might want to consider using the BotoMinioReader
as an alternative approach, as mentioned in the similar issue.
To continue talking to Dosu, mention @dosu.
@dosu: Re-checked file-path, credentials, dependencies - everything seem right. From my code can you please let me know if the file_extractor is correctly mapped
The file_extractor
in your code is correctly mapping the file extensions to the appropriate reader classes for PDF and DOCX files. The mapping is as follows:
file_extractor={".pdf": PDFReader(), ".docx": DocxReader()}
This indicates that PDF files (with the .pdf
extension) are mapped to the PDFReader
class, and DOCX files (with the .docx
extension) are mapped to the DocxReader
class [1].
To continue talking to Dosu, mention @dosu.
@Anitha6g is it a password protected PDF? Seems like the PDFReader itself can't load it
Could try using LlamaParse instead of PDFReader
@logan-markewich : No, the file is not password protected
@logan-markewich : I tried using LlamaParse as per your suggestion and a different pdf file. But the error seems to be there - Failed to load file test-llamaindex/anitha-version-FCCS.pdf with error: The input file_path must be a string or a list of strings.. Skipping...
This is my latest code : from llama_index.readers.s3 import S3Reader from llama_index.readers.file import PDFReader, DocxReader
from dotenv import load_dotenv load_dotenv()
from llama_parse import LlamaParse
parser = LlamaParse(
result_type="text" # "markdown" and "text" are available
)
file_extractor = {".pdf": parser}
reader = S3Reader(
bucket="test-llamaindex",
aws_access_id=AWS_ACCESS_KEY_ID,
aws_access_secret=AWS_SECRET_ACCESS_KEY,
aws_session_token=AWS_SESSION_TOKEN,
recursive=True,
key="anitha-version-FCCS.pdf",
required_exts=[".pdf"],
file_extractor=file_extractor
)
documents = reader.load_data()
documents
Appreciate your help on this.
FYI: Same error with docx as well - Failed to load file test-llamaindex/filename.docx with error: The input file_path must be a string or a list of strings.. Skipping...
@logan-markewich ISSUE: The issue is in the class PDFReader, in the below condition- """Parse file.""" if not isinstance(file, Path): file = Path(file) Since my file was of type <class 'pathlib.PurePosixPath'>, the IF condition passed and the Path(file) changes the "/" in the file path to "\" which is not supported by s3fs and gives this error - Attempt to open non key-like path: test-llamaindex\ShortStartStopMove.pdf in line with fs.open(str(file), "rb") as fp:
SOLUTION: Converting the file into posix path. """Parse file.""" if not isinstance(file, Path) and not isinstance(file, PurePosixPath): file = PurePosixPath(file)
@dosu: Please let me know the steps to create a PR to fix this.
The issue you're encountering is due to the PDFReader
class not handling PurePosixPath
correctly, which results in incorrect path formatting for S3. Your solution to convert the file into a PurePosixPath
is a good approach to maintain the correct path format.
To create a pull request (PR) to fix this issue, follow these general steps:
Fork the Repository: Go to the LlamaIndex repository on GitHub and fork it to your own GitHub account.
Clone Your Fork: Clone the forked repository to your local machine using:
git clone https://github.com/YOUR_USERNAME/llama_index.git
Create a New Branch: Navigate into the cloned repository and create a new branch for your fix:
cd llama_index
git checkout -b fix-pureposixpath-handling
Make Your Changes: Open the relevant file in your code editor and apply your fix. Ensure that the PDFReader
and any other relevant classes handle PurePosixPath
correctly.
Commit Your Changes: After making the changes, commit them with a descriptive message:
git add .
git commit -m "Fix handling of PurePosixPath in PDFReader"
Push Your Changes: Push the changes to your forked repository:
git push origin fix-pureposixpath-handling
Create a Pull Request: Go to the original LlamaIndex repository on GitHub. You should see a prompt to create a pull request from your new branch. Click on it and provide a clear description of the changes and why they are necessary.
Submit the PR: Once you've filled out the PR details, submit it for review.
Make sure to follow any contribution guidelines provided by the repository to ensure your PR is reviewed and accepted smoothly.
To continue talking to Dosu, mention @dosu.
Bug Description
I'm trying to read and load txt, pdf and docx files from s3 bucket using S3Reader. But getting this error for docx and pdf -
Failed to load file test-llamaindex/file1.docx with error: Attempt to open non key-like path: test-llamaindex\file1.docx. Skipping... Failed to load file test-llamaindex/file2.pdf with error: RetryError[<Future at 0x1122b0b42b0 state=finished raised ValueError>]. Skipping...
Version
llama_index Version: 0.11.18, llama-index-readers-s3 Version: 0.2.0
Steps to Reproduce
This is the code I'm using
`from llama_index.readers.s3 import S3Reader from llama_index.readers.file import PDFReader, DocxReader
reader = S3Reader( bucket="test-llamaindex", aws_access_id=AWS_ACCESS_KEY_ID, aws_access_secret=AWS_SECRET_ACCESS_KEY, aws_session_token=AWS_SESSION_TOKEN, recursive=True,
required_exts=[".pdf", ".docx", ".txt"], file_extractor={".pdf": PDFReader(), ".docx": DocxReader()} ) documents = reader.load_data() documents`
Relevant Logs/Tracbacks
No response