run-llama / llama-hub

A library of data loaders for LLMs made by the community -- to be used with LlamaIndex and/or LangChain
https://llamahub.ai/
MIT License
3.44k stars 729 forks source link

[Feature Request]: Original files name in the StorageContext metadata using S3Reader? #611

Open utility-aagrawal opened 10 months ago

utility-aagrawal commented 10 months ago

Feature Description

Use original file names of files from s3 buckets instead of random names generated using tempfile and make S3Reader more useful.

Reason

I am using S3Reader to read files from an s3 bucket: https://llamahub.ai/l/s3

This loader downloads documents into a temp folder but with arbitrary names. Consequently, if I save StorageContext created using this loader, these arbitrary names are present in the metadata. When I create a RAG pipeline using this StorageContext, my sources for the response are these arbitrary file names and not original file names. This is not really helpful if I want to present my users with sources of my search results. I made some changes locally to make it work but wanted to understand the thought process behind using random file names for this reader/loader?

Value of Feature

Users will be able to see sources with correct file names when creating a RAG pipeline using AWS s3 buckets.

utility-aagrawal commented 10 months ago

@jerryjliu @logan-markewich Can you please advise?

logan-markewich commented 10 months ago

Feel free to make a PR to change how the loader works 🙂 I don't have any strong opinion on this

nerdai commented 9 months ago

Feature Description

Use original file names of files from s3 buckets instead of random names generated using tempfile and make S3Reader more useful.

Reason

I am using S3Reader to read files from an s3 bucket: https://llamahub.ai/l/s3

This loader downloads documents into a temp folder but with arbitrary names. Consequently, if I save StorageContext created using this loader, these arbitrary names are present in the metadata. When I create a RAG pipeline using this StorageContext, my sources for the response are these arbitrary file names and not original file names. This is not really helpful if I want to present my users with sources of my search results. I made some changes locally to make it work but wanted to understand the thought process behind using random file names for this reader/loader?

Value of Feature

Users will be able to see sources with correct file names when creating a RAG pipeline using AWS s3 buckets.

Thanks @utility-aagrawal, we just adopted somewhat typical pattern here for downloading i.e., spin up a tempdir, tempfile, get what you need and it will clean up the temporary dir and files. But your situation necessitates a different solution, which is fine. Reviewing your PR now. thanks for the contribution!