Open navidre opened 1 year ago
In the related pull request, it's mentioned you can use GenericLoader
instead. Does anyone know how that works exactly? I would like to load a BytesIO
object or similar as well, but GenericLoader
only has a from_filesystem
class method.
In the related pull request, it's mentioned you can use
GenericLoader
instead. Does anyone know how that works exactly? I would like to load aBytesIO
object or similar as well, butGenericLoader
only has afrom_filesystem
class method.
+1 When downloading the document, it's much more convenient to load the bytes rather than saving to a temp file etc
+1 PDFs can also come back in bytes from some endpoint and not necessarily be read from a file path. Doesn't seem that GenericLoader supports this. I ended up using PyPDF2
to extract the text and shoving it into the vector store for embedding it.
pdf_stream = BytesIO(content_in_bytes)
pdf_reader = PyPDF2.PdfReader(pdf_stream)
resume_text = str()
# Extract text
for page in pdf_reader.pages:
resume_text += page.extract_text()
+1, would be nice to have for other loaders as well
+1
I want to load a pdf file uploaded using flask POST api. I am receiving the file with type FileStorage
and not able to load it using the PyPDFLoader. How would I handle this case?
file = request['file]
pages = PyPDFLoader(file).load_and_split()
throws error saying it needs to be string path. I don't want to save file locally, I can get it from Google Cloud storage though. any work arounds that I can have to get this working?
at the moment it seems like the only way is either to save it locally or on a mounted cloud storage.
+1. Current best solution is to use a package like PyPDF2
to read in text.
If anyone is wondering or wants a quick fix for now, here's how you can load PDFs from a BytesIO stream using PyMuPDF, which overrides the built-in PyMuPDFLoader to handle PDFs from BytesIO.
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.document_loaders.parsers.pdf import (
PyMuPDFParser,
)
class BytesIOPyMuPDFLoader(PyMuPDFLoader):
"""Load `PDF` files using `PyMuPDF` from a BytesIO stream."""
def __init__(
self,
pdf_stream: BytesIO,
*,
extract_images: bool = False,
**kwargs: Any,
) -> None:
"""Initialize with a BytesIO stream."""
try:
import fitz # noqa:F401
except ImportError:
raise ImportError(
"`PyMuPDF` package not found, please install it with "
"`pip install pymupdf`"
)
# We don't call the super().__init__ here because we don't have a file_path.
self.pdf_stream = pdf_stream
self.extract_images = extract_images
self.text_kwargs = kwargs
def load(self, **kwargs: Any) -> List[Document]:
"""Load file."""
if kwargs:
logging.warning(
f"Received runtime arguments {kwargs}. Passing runtime args to `load`"
f" is deprecated. Please pass arguments during initialization instead."
)
text_kwargs = {**self.text_kwargs, **kwargs}
# Use 'stream' as a placeholder for file_path since we're working with a stream.
blob = Blob.from_data(self.pdf_stream.getvalue(), path="stream")
parser = PyMuPDFParser(
text_kwargs=text_kwargs, extract_images=self.extract_images
)
return parser.parse(blob)
PyPDFLoader might also follow the same pattern
+1
@navidre I'd like to get assigned to this issue. Thanks!
Hi @AyushModi123, Not sure if I have access to assign, but feel free to create a new PR and tag this issue
The preferred way to achieve this task is using BaseBlobParsers and Blob objects. See here: https://python.langchain.com/docs/modules/data_connection/document_loaders/custom#working-with-files
Parsers are currently only documented in the code base, but there are a number of PDF parsers available already!
https://python.langchain.com/docs/modules/data_connection/document_loaders/custom#working-with-files
Feature request
class PyPDFLoader in document_loaders/pdf.py to accept bytes object as well.
Motivation
When a PDF file is uploaded using a REST API call, there is no specific file_path to load from. The solution can be to use file bytes instead as input parameter.
Your contribution
I can submit a PR