PyPDFLoader to accept bytes objects as well

navidre commented 1 year ago

Feature request

class PyPDFLoader in document_loaders/pdf.py to accept bytes object as well.

Motivation

When a PDF file is uploaded using a REST API call, there is no specific file_path to load from. The solution can be to use file bytes instead as input parameter.

Your contribution

I can submit a PR

rsokolewicz commented 10 months ago

In the related pull request, it's mentioned you can use GenericLoader instead. Does anyone know how that works exactly? I would like to load a BytesIO object or similar as well, but GenericLoader only has a from_filesystem class method.

ton77v commented 9 months ago

In the related pull request, it's mentioned you can use GenericLoader instead. Does anyone know how that works exactly? I would like to load a BytesIO object or similar as well, but GenericLoader only has a from_filesystem class method.

+1 When downloading the document, it's much more convenient to load the bytes rather than saving to a temp file etc

Siddhartha90 commented 9 months ago

+1 PDFs can also come back in bytes from some endpoint and not necessarily be read from a file path. Doesn't seem that GenericLoader supports this. I ended up using PyPDF2 to extract the text and shoving it into the vector store for embedding it.

    pdf_stream = BytesIO(content_in_bytes)
    pdf_reader = PyPDF2.PdfReader(pdf_stream)
    resume_text = str()
    # Extract text
    for page in pdf_reader.pages:
        resume_text += page.extract_text()

FrancescoSaverioZuppichini commented 8 months ago

+1, would be nice to have for other loaders as well

pgrach commented 8 months ago

+1

vivek-shetye-pdl commented 8 months ago

I want to load a pdf file uploaded using flask POST api. I am receiving the file with type FileStorage and not able to load it using the PyPDFLoader. How would I handle this case?

file = request['file] pages = PyPDFLoader(file).load_and_split()

throws error saying it needs to be string path. I don't want to save file locally, I can get it from Google Cloud storage though. any work arounds that I can have to get this working?

rsokolewicz commented 8 months ago

at the moment it seems like the only way is either to save it locally or on a mounted cloud storage.

LeonLiur commented 7 months ago

+1. Current best solution is to use a package like PyPDF2 to read in text.

amn-max commented 5 months ago

If anyone is wondering or wants a quick fix for now, here's how you can load PDFs from a BytesIO stream using PyMuPDF, which overrides the built-in PyMuPDFLoader to handle PDFs from BytesIO.

from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.document_loaders.parsers.pdf import (
    PyMuPDFParser,
)

class BytesIOPyMuPDFLoader(PyMuPDFLoader):
    """Load `PDF` files using `PyMuPDF` from a BytesIO stream."""

    def __init__(
        self,
        pdf_stream: BytesIO,
        *,
        extract_images: bool = False,
        **kwargs: Any,
    ) -> None:
        """Initialize with a BytesIO stream."""
        try:
            import fitz  # noqa:F401
        except ImportError:
            raise ImportError(
                "`PyMuPDF` package not found, please install it with "
                "`pip install pymupdf`"
            )
        # We don't call the super().__init__ here because we don't have a file_path.
        self.pdf_stream = pdf_stream
        self.extract_images = extract_images
        self.text_kwargs = kwargs

    def load(self, **kwargs: Any) -> List[Document]:
        """Load file."""
        if kwargs:
            logging.warning(
                f"Received runtime arguments {kwargs}. Passing runtime args to `load`"
                f" is deprecated. Please pass arguments during initialization instead."
            )

        text_kwargs = {**self.text_kwargs, **kwargs}

        # Use 'stream' as a placeholder for file_path since we're working with a stream.
        blob = Blob.from_data(self.pdf_stream.getvalue(), path="stream")

        parser = PyMuPDFParser(
            text_kwargs=text_kwargs, extract_images=self.extract_images
        )

        return parser.parse(blob)

PyPDFLoader might also follow the same pattern