langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
88.77k stars 13.96k forks source link

PyPDFLoader to accept bytes objects as well #6265

Open navidre opened 1 year ago

navidre commented 1 year ago

Feature request

class PyPDFLoader in document_loaders/pdf.py to accept bytes object as well.

Motivation

When a PDF file is uploaded using a REST API call, there is no specific file_path to load from. The solution can be to use file bytes instead as input parameter.

Your contribution

I can submit a PR

rsokolewicz commented 10 months ago

In the related pull request, it's mentioned you can use GenericLoader instead. Does anyone know how that works exactly? I would like to load a BytesIO object or similar as well, but GenericLoader only has a from_filesystem class method.

ton77v commented 9 months ago

In the related pull request, it's mentioned you can use GenericLoader instead. Does anyone know how that works exactly? I would like to load a BytesIO object or similar as well, but GenericLoader only has a from_filesystem class method.

+1 When downloading the document, it's much more convenient to load the bytes rather than saving to a temp file etc

Siddhartha90 commented 9 months ago

+1 PDFs can also come back in bytes from some endpoint and not necessarily be read from a file path. Doesn't seem that GenericLoader supports this. I ended up using PyPDF2 to extract the text and shoving it into the vector store for embedding it.

    pdf_stream = BytesIO(content_in_bytes)
    pdf_reader = PyPDF2.PdfReader(pdf_stream)
    resume_text = str()
    # Extract text
    for page in pdf_reader.pages:
        resume_text += page.extract_text()
FrancescoSaverioZuppichini commented 8 months ago

+1, would be nice to have for other loaders as well

pgrach commented 8 months ago

+1

vivek-shetye-pdl commented 8 months ago

I want to load a pdf file uploaded using flask POST api. I am receiving the file with type FileStorage and not able to load it using the PyPDFLoader. How would I handle this case?

file = request['file] pages = PyPDFLoader(file).load_and_split()

throws error saying it needs to be string path. I don't want to save file locally, I can get it from Google Cloud storage though. any work arounds that I can have to get this working?

rsokolewicz commented 8 months ago

at the moment it seems like the only way is either to save it locally or on a mounted cloud storage.

LeonLiur commented 7 months ago

+1. Current best solution is to use a package like PyPDF2 to read in text.

amn-max commented 5 months ago

If anyone is wondering or wants a quick fix for now, here's how you can load PDFs from a BytesIO stream using PyMuPDF, which overrides the built-in PyMuPDFLoader to handle PDFs from BytesIO.

from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.document_loaders.parsers.pdf import (
    PyMuPDFParser,
)

class BytesIOPyMuPDFLoader(PyMuPDFLoader):
    """Load `PDF` files using `PyMuPDF` from a BytesIO stream."""

    def __init__(
        self,
        pdf_stream: BytesIO,
        *,
        extract_images: bool = False,
        **kwargs: Any,
    ) -> None:
        """Initialize with a BytesIO stream."""
        try:
            import fitz  # noqa:F401
        except ImportError:
            raise ImportError(
                "`PyMuPDF` package not found, please install it with "
                "`pip install pymupdf`"
            )
        # We don't call the super().__init__ here because we don't have a file_path.
        self.pdf_stream = pdf_stream
        self.extract_images = extract_images
        self.text_kwargs = kwargs

    def load(self, **kwargs: Any) -> List[Document]:
        """Load file."""
        if kwargs:
            logging.warning(
                f"Received runtime arguments {kwargs}. Passing runtime args to `load`"
                f" is deprecated. Please pass arguments during initialization instead."
            )

        text_kwargs = {**self.text_kwargs, **kwargs}

        # Use 'stream' as a placeholder for file_path since we're working with a stream.
        blob = Blob.from_data(self.pdf_stream.getvalue(), path="stream")

        parser = PyMuPDFParser(
            text_kwargs=text_kwargs, extract_images=self.extract_images
        )

        return parser.parse(blob)

PyPDFLoader might also follow the same pattern

regularE commented 5 months ago

+1

AyushModi123 commented 4 months ago

@navidre I'd like to get assigned to this issue. Thanks!

navidre commented 4 months ago

Hi @AyushModi123, Not sure if I have access to assign, but feel free to create a new PR and tag this issue

eyurtsev commented 3 months ago

The preferred way to achieve this task is using BaseBlobParsers and Blob objects. See here: https://python.langchain.com/docs/modules/data_connection/document_loaders/custom#working-with-files

Parsers are currently only documented in the code base, but there are a number of PDF parsers available already!

https://python.langchain.com/docs/modules/data_connection/document_loaders/custom#working-with-files