Open utility-aagrawal opened 10 months ago
@jerryjliu @logan-markewich Can you please advise?
Feel free to make a PR to change how the loader works 🙂 I don't have any strong opinion on this
Feature Description
Use original file names of files from s3 buckets instead of random names generated using tempfile and make S3Reader more useful.
Reason
I am using S3Reader to read files from an s3 bucket: https://llamahub.ai/l/s3
This loader downloads documents into a temp folder but with arbitrary names. Consequently, if I save StorageContext created using this loader, these arbitrary names are present in the metadata. When I create a RAG pipeline using this StorageContext, my sources for the response are these arbitrary file names and not original file names. This is not really helpful if I want to present my users with sources of my search results. I made some changes locally to make it work but wanted to understand the thought process behind using random file names for this reader/loader?
Value of Feature
Users will be able to see sources with correct file names when creating a RAG pipeline using AWS s3 buckets.
Thanks @utility-aagrawal, we just adopted somewhat typical pattern here for downloading i.e., spin up a tempdir, tempfile, get what you need and it will clean up the temporary dir and files. But your situation necessitates a different solution, which is fine. Reviewing your PR now. thanks for the contribution!
Feature Description
Use original file names of files from s3 buckets instead of random names generated using tempfile and make S3Reader more useful.
Reason
I am using S3Reader to read files from an s3 bucket: https://llamahub.ai/l/s3
This loader downloads documents into a temp folder but with arbitrary names. Consequently, if I save StorageContext created using this loader, these arbitrary names are present in the metadata. When I create a RAG pipeline using this StorageContext, my sources for the response are these arbitrary file names and not original file names. This is not really helpful if I want to present my users with sources of my search results. I made some changes locally to make it work but wanted to understand the thought process behind using random file names for this reader/loader?
Value of Feature
Users will be able to see sources with correct file names when creating a RAG pipeline using AWS s3 buckets.