run-llama / llama-hub

A library of data loaders for LLMs made by the community -- to be used with LlamaIndex and/or LangChain
https://llamahub.ai/
MIT License
3.44k stars 731 forks source link

Handle bytestrings in PDF reader #834

Closed PhilippeMoussalli closed 8 months ago

PhilippeMoussalli commented 8 months ago

Description

A fix for the PDFReader to handle reading files as bytestrings. The loader already supported that function but there was a small bug related to the fact that it expected a file name to be fetched which is not available when passing a pdf bystring. This PR addresses this. The file name can still be added manually using the extra_info argument.

PDF bytestrings can be useful in case the user wants to load in case a user uses a custom filesystem interface like fsspec

example:

# Create the GCS file system
gcs = fsspec.filesystem('gcs')

# Open and read the PDF file from GCS
with gcs.open(f'{bucket_name}/{file_path}', 'rb') as file:
    pdf_content = BytesIO(file.read())

Fixes # (issue)

Type of Change

Please delete options that are not relevant.

How Has This Been Tested?

I ran the class locally with different file inputs (path to pdf, pdf bytestrings)

Suggested Checklist: