run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.04k stars 5.13k forks source link

[Bug]: Failed to load file using s3Reader and SimpleDirectoryReader with aws s3 #14792

Open MLai0519 opened 2 months ago

MLai0519 commented 2 months ago

Bug Description

Failed to load file with the readers. Tested the connection with boto3 and also s3fs alone, can reached the s3 bucket.

Version

0.10.54

Steps to Reproduce

I am using the below code, with llama-index-readers-s3 = 0.1.10, s3fs = 2024.6.1. Bucket is the target bucket and folder is just the subdirectory. With the same key and secret, I can use boto3 to access the bucket for upload and download using s3 client.

s3_fs.invalidate_cache()
def readwithS3Loader():
    loader = S3Reader(
        bucket=bucket,
        prefix=folder,
        aws_access_id=getawsaccess('uat')[0],
        aws_access_secret=getawsaccess('uat')[1],
    )
    all_docs = loader.load_data()
    print(all_docs)
    return all_docs

def readfilefroms3():
    reader = SimpleDirectoryReader(
        input_dir=bucket+'/'+folder,
        fs=s3_fs,
        filename_as_id = True,
    )
    all_docs = reader.load_data()
    print(all_docs)
    return all_docs`

### Relevant Logs/Tracbacks

```shell
Let bucket = 'testing-bucket'
Failed to load file testing-bucket/{folder}/file.docx with error: Attempt to open non key-like path: testing-bucket/{folder}/file.docx. Skipping...
dosubot[bot] commented 2 months ago

Hey @MLai0519! I'm here to assist you with any bugs, questions, or contributions. Let's work together to resolve this issue.

To resolve the issue of failing to load a file using S3Reader and SimpleDirectoryReader with AWS S3, despite successfully testing the connection with boto3 and s3fs, you should ensure that the SimpleDirectoryReader is correctly configured with the necessary parameters. Here is a sample configuration:

from llama_index.core.readers import SimpleDirectoryReader
from s3fs import S3FileSystem

# Create the filesystem using s3fs
s3_fs = S3FileSystem(anon=False, key=getawsaccess('uat')[0], secret=getawsaccess('uat')[1], cache_regions=False)

# Initialize the SimpleDirectoryReader with the required parameters
reader = SimpleDirectoryReader(
    input_dir='your-bucket-name/your-folder',
    fs=s3_fs,
    recursive=True,  # Recursively searches all subdirectories
    filename_as_id=True,
)

# Load the files as documents
docs = reader.load_data()
print(f"Loaded {len(docs)} docs")

# If you need to load files asynchronously
# import nest_asyncio
# nest_asyncio.apply()
# docs = await reader.aload_data()
# print(f"Loaded {len(docs)} docs")

Ensure that:

  1. The input_dir parameter is correctly set to the name of your S3 bucket and folder.
  2. The fs parameter is set to an instance of S3FileSystem with the correct credentials.
  3. The recursive parameter is set to True if you want to search all subdirectories.

Alternatively, you can use the BotoMinioReader class from the LlamaIndex library, which is specifically designed to handle S3 file and directory reading using boto3. Here is an example of how to use it:

from llama_index.readers.minio.boto3_client.base import BotoMinioReader

# Initialize the BotoMinioReader with the required parameters
reader = BotoMinioReader(
    bucket='your-bucket-name',
    prefix='your-folder',  # Optional: to filter by prefix
    aws_access_id=getawsaccess('uat')[0],
    aws_access_secret=getawsaccess('uat')[1],
    s3_endpoint_url='your-s3-endpoint-url',
)

# Load the files as documents
docs = reader.load_data()
print(f"Loaded {len(docs)} docs")

Ensure that:

  1. The bucket parameter is correctly set to the name of your S3 bucket.
  2. The prefix parameter is set if you want to filter by a specific prefix.
  3. The AWS credentials (aws_access_id and aws_access_secret) are correctly provided.
  4. The s3_endpoint_url is correctly set to your S3 endpoint URL.

This configuration should help in resolving the issue of failing to load the file [1][2][3].

To continue talking to Dosu, mention @dosu.