langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
92.66k stars 14.83k forks source link

UnstructuredPDFLoader: poppler and tesseract not found issue #26137

Open Yuxuan1998 opened 2 weeks ago

Yuxuan1998 commented 2 weeks ago

Checked other resources

Example Code

from langchain_community.document_loaders import UnstructuredPDFLoader
loader = UnstructuredPDFLoader(path_to_pdf)
loader.load()[0].page_content

Error Message and Stack Trace (if applicable)

  1. PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?
  2. TesseractNotFoundError: tesseract is not installed or it's not in your PATH. See README file for more information.

Description

I'm trying to use UnstructuredPDFLoader to load pdf but encounter errors as mentioned above.

PDFInfoNotInstalledError

  1. I have installed poppler and add into PATH.✅
  2. when i run which pdfinfo it returns me the correct path /opt/homebrew/Cellar/poppler/24.04.0_1/bin/pdfinfo
  3. However, if i run poppler --version, I get zsh: command not found: poppler, and this happends to my other laptops as well❓
  4. This problem resolves if I manually change the paramter default value of all poppler_path from None to the path: poppler_path: Union[str, PurePath] = "/opt/homebrew/Cellar/poppler/24.04.0_1/bin/" (in ./.venv/lib/python3.11/site-packages/pdf2image/pdf2image.py)
  5. But it will give TesseractNotFoundError

TesseractNotFoundError

  1. I have installed tesseract and add into PATH.✅
  2. when i run which tesseract it returns me the correct path /opt/homebrew/bin/tesseract
  3. when i run tesseract --version it returns me the correct verssion✅
  4. This problem resolves if I manually change variabletesseract_cmd from 'tesseract to the path: tesseract_cmd = '/opt/homebrew/Cellar/tesseract/5.4.1/bin/tesseract' (in ./.venv/lib/python3.11/site-packages/unstructured_pytesseract/pytesseract.py)

System Info

Package Information

langchain_core: 0.2.38 langchain: 0.2.16 langchain_community: 0.2.16 langsmith: 0.1.115 langchain_google_vertexai: 1.0.10 langchain_text_splitters: 0.2.4 langgraph: 0.2.18

platform mac

Python 3.11.3

sanjeev-kallepalli commented 3 days ago

Any update on this? am also getting the same issue. I have both Poppler and Tesseract installed in my windowspc

sanjeev-kallepalli commented 3 days ago

@Yuxuan1998 try this. It resolved my issues. My tesseract was installed here. The global variable in pytesseract was set to tesseract. You can view it if you open pytesseract.py file under unstructured_pytesseract of your .env folder.

import unstructured_pytesseract
unstructured_pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'