langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
92k stars 14.64k forks source link

OnlinePDFLoader crashes with import error on Google Colab #20700

Closed ishan-siddiqui closed 1 month ago

ishan-siddiqui commented 4 months ago

Checked other resources

Example Code

Steps to Replicate:

Requirements.txt

%%writefile requirements.txt
replicate
langchain
langchain-community
sentence-transformers
pdf2image
pdfminer
pdfminer.six
unstructured
faiss-gpu
uvicorn
ctransformers
python-box
streamlit

Installing on colab

!pip install -r requirements.txt

Code I am trying to run

# Load the external data source
from langchain.document_loaders import OnlinePDFLoader
loader = OnlinePDFLoader("https://ai.meta.com/static-resource/responsible-use-guide/")
documents = loader.load()

Error Message and Stack Trace (if applicable)

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
[<ipython-input-90-759c82deb3bb>](https://localhost:8080/#) in <cell line: 4>()
      2 from langchain_community.document_loaders import OnlinePDFLoader
      3 loader = OnlinePDFLoader("https://ai.meta.com/static-resource/responsible-use-guide/")
----> 4 documents = loader.load()
      5 
      6 # Step 2: Get text splits from Document

4 frames
[/usr/local/lib/python3.10/dist-packages/langchain_community/document_loaders/pdf.py](https://localhost:8080/#) in load(self)
    157         """Load documents."""
    158         loader = UnstructuredPDFLoader(str(self.file_path))
--> 159         return loader.load()
    160 
    161 

[/usr/local/lib/python3.10/dist-packages/langchain_core/document_loaders/base.py](https://localhost:8080/#) in load(self)
     27     def load(self) -> List[Document]:
     28         """Load data into Document objects."""
---> 29         return list(self.lazy_load())
     30 
     31     async def aload(self) -> List[Document]:

[/usr/local/lib/python3.10/dist-packages/langchain_community/document_loaders/unstructured.py](https://localhost:8080/#) in lazy_load(self)
     86     def lazy_load(self) -> Iterator[Document]:
     87         """Load file."""
---> 88         elements = self._get_elements()
     89         self._post_process_elements(elements)
     90         if self.mode == "elements":

[/usr/local/lib/python3.10/dist-packages/langchain_community/document_loaders/pdf.py](https://localhost:8080/#) in _get_elements(self)
     69 
     70     def _get_elements(self) -> List:
---> 71         from unstructured.partition.pdf import partition_pdf
     72 
     73         return partition_pdf(filename=self.file_path, **self.unstructured_kwargs)

[/usr/local/lib/python3.10/dist-packages/unstructured/partition/pdf.py](https://localhost:8080/#) in <module>
     36 from pdfminer.utils import open_filename
     37 from PIL import Image as PILImage
---> 38 from pillow_heif import register_heif_opener
     39 
     40 from unstructured.chunking import add_chunking_strategy

ModuleNotFoundError: No module named 'pillow_heif'

---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------

Description

System Info

Langchain Version on Google Colab

langchain==0.1.16
langchain-community==0.0.34
langchain-core==0.1.45
langchain-text-splitters==0.0.1

Langchain Community Version on Google Colab

langchain-community==0.0.34
ishan-siddiqui commented 4 months ago

Trying to follow Meta Developer's llama-2 tutorial. Here's a link for reference - https://youtu.be/Z5MFSlDrOdA?t=1539

salikadave commented 4 months ago

Hi @ishan-siddiqui , you will need to install the unstructuredpackage before the import:

pip install unstructured[all-docs]

Source: unstructured_file.ipynb