langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
94.87k stars 15.36k forks source link

PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH? #16085

Closed KaifAhmad1 closed 8 months ago

KaifAhmad1 commented 10 months ago

Checked other resources

Example Code

from langchain_community.document_loaders import DirectoryLoader
directory = '/content/drive/MyDrive/History_QA_dataset'

def load_docs(directory):
    return DirectoryLoader(directory).load()
documents = load_docs(directory)

Description

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/pdf2image/pdf2image.py](https://localhost:8080/#) in pdfinfo_from_path(pdf_path, userpw, ownerpw, poppler_path, rawdates, timeout, first_page, last_page)
    580             env["LD_LIBRARY_PATH"] = poppler_path + ":" + env.get("LD_LIBRARY_PATH", "")
--> 581         proc = Popen(command, env=env, stdout=PIPE, stderr=PIPE)
    582 

19 frames
FileNotFoundError: [Errno 2] No such file or directory: 'pdfinfo'

During handling of the above exception, another exception occurred:

PDFInfoNotInstalledError                  Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/pdf2image/pdf2image.py](https://localhost:8080/#) in pdfinfo_from_path(pdf_path, userpw, ownerpw, poppler_path, rawdates, timeout, first_page, last_page)
    605 
    606     except OSError:
--> 607         raise PDFInfoNotInstalledError(
    608             "Unable to get page count. Is poppler installed and in PATH?"
    609         )

PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

System Info

Windows: 11 pip == 23.3.1 python == 3.10.10 langchain == 0.1.0 transformers == 4.36.2 sentence_transformers == 2.2.2 unstructured == 0.12.0

Related Components

dosubot[bot] commented 10 months ago

🤖

Hey @KaifAhmad1! Good to see you again. Thanks for reaching out with your issue. 🚀

The error you're encountering is related to the pdf2image package, which is used to convert PDF files into images. The error message suggests that the poppler utility, which pdf2image relies on, is not installed or not in the system's PATH.

The DirectoryLoader class in LangChain, which you're using in your code, is responsible for loading documents from a specified directory. However, it doesn't directly use the pdf2image package. The usage of pdf2image might be in the specific loader class for PDF files, which is used by DirectoryLoader when it encounters a PDF file.

There are a couple of similar issues in the LangChain repository, but they don't seem to be directly related to your problem. The first one suggests trying a different version of the Pillow package, and the second one is about an error with the VectorstoreIndexCreator class.

To resolve your issue, you need to install poppler on your system and make sure it's in your system's PATH. The process to do this varies depending on your operating system. Since you're using Windows, you can download a precompiled binary of poppler for Windows from this link. After downloading and extracting the files, you need to add the bin directory to your system's PATH.

Here's how you can do it:

  1. Press the Windows key and search for "Environment Variables".
  2. Click on "Edit the system environment variables".
  3. In the System Properties window that opens, click on "Environment Variables".
  4. In the Environment Variables window, under "System variables", find and select "Path", then click on "Edit".
  5. In the Edit Environment Variable window, click on "New", then paste the path to the bin directory of the extracted poppler files.
  6. Click "OK" on all windows to close them.

After doing this, try running your code again. If you still encounter the same error, it might be because the system environment variables haven't been updated in your current command prompt or IDE. Try closing and reopening it, then run your code again.

If you're still having trouble, please provide more details about your setup and the exact steps you're taking so I can better assist you.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

KaifAhmad1 commented 10 months ago

Hey @dosu-bot,

Thanks for the guidance! 🙌 The poppler download link seems to be a placeholder. Can you provide the correct one?

Appreciate your help, @KaifAhmad1

dosubot[bot] commented 10 months ago

🤖

Sure, I can help with that. The error message you're seeing is indicating that the poppler utility is not installed on your system. poppler is a PDF rendering library which is used by the pdf2image Python package to convert PDF files into images.

You can download and install poppler for Windows from this link: Poppler for Windows. After downloading, you need to add the bin directory from the downloaded package to your PATH environment variable.

Here are the steps to do this:

  1. Download the Poppler for Windows package.
  2. Extract the files from the downloaded package.
  3. Open the System Properties (Right click Computer in the start menu, or use the keyboard shortcut Win+Pause)
  4. Click Advanced system settings in the sidebar.
  5. Click Environment Variables...
  6. Select the PATH variable and click Edit...
  7. Add the path to the bin directory in the Poppler package to the end of the Variable value, separated from the existing paths by a semicolon.

After doing this, you should be able to use the pdf2image package without encountering the PDFInfoNotInstalledError.

If you still encounter issues after following these steps, please let me know!

For more information, you can refer to the Poppler for Windows website and the pdf2image package documentation.

I hope this helps! If you have any other questions, feel free to ask.

-Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

KaifAhmad1 commented 10 months ago

Hey @dosubot

The poppler download link seems to be a placeholder. Can you provide the correct one? This is the link Poppler for Windows which is not opening. Can you provide an alternative source?

@KaifAhmad1

dosubot[bot] commented 10 months ago

🤖

I'm sorry for the inconvenience you're experiencing. The error you're encountering is due to the absence of the poppler utility in your system's PATH. poppler is a PDF rendering library, which is used by the pdf2image package to convert PDF files into images.

The link you provided for Poppler for Windows seems to be down. As an alternative, you can download precompiled Poppler binaries for Windows from this GitHub repository. After downloading, you need to add the bin directory from the downloaded package to your system's PATH.

Here's how you can do it:

  1. Download the latest release from the GitHub repository.
  2. Extract the downloaded file.
  3. Add the full path of the bin directory from the extracted folder to your system's PATH.

To add the directory to your system's PATH, follow these steps:

  1. Press Win + X and choose System.
  2. Click on Advanced system settings.
  3. In the System Properties window that opens, click on the Environment Variables button.
  4. In the Environment Variables window, under System variables, scroll down and select Path, then click on Edit.
  5. In the Edit environment variable window, click on New and then paste the full path of the bin directory.
  6. Click OK on all windows to save the changes.

After doing this, you should be able to use poppler from any command prompt or terminal window. Please try running your code again after following these steps.

If you're still encountering issues, please let me know!

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

KaifAhmad1 commented 10 months ago

Hey, @dosubot
I have downloaded poppler 23.08.0 and initialised the bin path in my environment variable but still I am getting same error.

directory = '/content/drive/MyDrive/History_QA_dataset'
from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader(directory)
documents = loader.load()
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/pdf2image/pdf2image.py](https://localhost:8080/#) in pdfinfo_from_path(pdf_path, userpw, ownerpw, poppler_path, rawdates, timeout, first_page, last_page)
    580             env["LD_LIBRARY_PATH"] = poppler_path + ":" + env.get("LD_LIBRARY_PATH", "")
--> 581         proc = Popen(command, env=env, stdout=PIPE, stderr=PIPE)
    582 

18 frames
[/usr/lib/python3.10/subprocess.py](https://localhost:8080/#) in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, user, group, extra_groups, encoding, errors, text, umask, pipesize)
    970 
--> 971             self._execute_child(args, executable, preexec_fn, close_fds,
    972                                 pass_fds, cwd, env,

[/usr/lib/python3.10/subprocess.py](https://localhost:8080/#) in _execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, restore_signals, gid, gids, uid, umask, start_new_session)
   1862                         err_msg = os.strerror(errno_num)
-> 1863                     raise child_exception_type(errno_num, err_msg, err_filename)
   1864                 raise child_exception_type(err_msg)

FileNotFoundError: [Errno 2] No such file or directory: 'pdfinfo'

During handling of the above exception, another exception occurred:

PDFInfoNotInstalledError                  Traceback (most recent call last)
[<ipython-input-19-78f443fb7723>](https://localhost:8080/#) in <cell line: 2>()
      1 loader = DirectoryLoader(directory)
----> 2 documents = loader.load()

[/usr/local/lib/python3.10/dist-packages/langchain_community/document_loaders/directory.py](https://localhost:8080/#) in load(self)
    155         else:
    156             for i in items:
--> 157                 self.load_file(i, p, docs, pbar)
    158 
    159         if pbar:

[/usr/local/lib/python3.10/dist-packages/langchain_community/document_loaders/directory.py](https://localhost:8080/#) in load_file(self, item, path, docs, pbar)
    104                         logger.warning(f"Error loading file {str(item)}: {e}")
    105                     else:
--> 106                         raise e
    107                 finally:
    108                     if pbar:

[/usr/local/lib/python3.10/dist-packages/langchain_community/document_loaders/directory.py](https://localhost:8080/#) in load_file(self, item, path, docs, pbar)
     98                 try:
     99                     logger.debug(f"Processing file: {str(item)}")
--> 100                     sub_docs = self.loader_cls(str(item), **self.loader_kwargs).load()
    101                     docs.extend(sub_docs)
    102                 except Exception as e:

[/usr/local/lib/python3.10/dist-packages/langchain_community/document_loaders/unstructured.py](https://localhost:8080/#) in load(self)
     85     def load(self) -> List[Document]:
     86         """Load file."""
---> 87         elements = self._get_elements()
     88         self._post_process_elements(elements)
     89         if self.mode == "elements":

[/usr/local/lib/python3.10/dist-packages/langchain_community/document_loaders/unstructured.py](https://localhost:8080/#) in _get_elements(self)
    171         from unstructured.partition.auto import partition
    172 
--> 173         return partition(filename=self.file_path, **self.unstructured_kwargs)
    174 
    175     def _get_metadata(self) -> dict:

[/usr/local/lib/python3.10/dist-packages/unstructured/partition/auto.py](https://localhost:8080/#) in partition(filename, content_type, file, file_filename, url, include_page_breaks, strategy, encoding, paragraph_grouper, headers, skip_infer_table_types, ssl_verify, ocr_languages, languages, detect_language_per_element, pdf_infer_table_structure, extract_images_in_pdf, extract_image_block_types, extract_image_block_output_dir, extract_image_block_to_payload, xml_keep_tags, data_source_metadata, metadata_filename, request_timeout, hi_res_model_name, model_name, **kwargs)
    407     elif filetype == FileType.PDF:
    408         _partition_pdf = _get_partition_with_extras("pdf")
--> 409         elements = _partition_pdf(
    410             filename=filename,  # type: ignore
    411             file=file,  # type: ignore

[/usr/local/lib/python3.10/dist-packages/unstructured/documents/elements.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
    516         @functools.wraps(func)
    517         def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> List[Element]:
--> 518             elements = func(*args, **kwargs)
    519             sig = inspect.signature(func)
    520             params: Dict[str, Any] = dict(**dict(zip(sig.parameters, args)), **kwargs)

[/usr/local/lib/python3.10/dist-packages/unstructured/file_utils/filetype.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
    589         @functools.wraps(func)
    590         def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> List[Element]:
--> 591             elements = func(*args, **kwargs)
    592             sig = inspect.signature(func)
    593             params: Dict[str, Any] = dict(**dict(zip(sig.parameters, args)), **kwargs)

[/usr/local/lib/python3.10/dist-packages/unstructured/file_utils/filetype.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
    544     @functools.wraps(func)
    545     def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> List[Element]:
--> 546         elements = func(*args, **kwargs)
    547         sig = inspect.signature(func)
    548         params: Dict[str, Any] = dict(**dict(zip(sig.parameters, args)), **kwargs)

[/usr/local/lib/python3.10/dist-packages/unstructured/chunking/__init__.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
     50         @functools.wraps(func)
     51         def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> List[Element]:
---> 52             elements = func(*args, **kwargs)
     53             sig = inspect.signature(func)
     54             params: Dict[str, Any] = dict(**dict(zip(sig.parameters, args)), **kwargs)

[/usr/local/lib/python3.10/dist-packages/unstructured/partition/pdf.py](https://localhost:8080/#) in partition_pdf(filename, file, include_page_breaks, strategy, infer_table_structure, ocr_languages, languages, include_metadata, metadata_filename, metadata_last_modified, chunking_strategy, links, hi_res_model_name, extract_images_in_pdf, extract_image_block_types, extract_image_block_output_dir, extract_image_block_to_payload, **kwargs)
    204     languages = check_languages(languages, ocr_languages)
    205 
--> 206     return partition_pdf_or_image(
    207         filename=filename,
    208         file=file,

[/usr/local/lib/python3.10/dist-packages/unstructured/partition/pdf.py](https://localhost:8080/#) in partition_pdf_or_image(filename, file, is_image, include_page_breaks, strategy, infer_table_structure, ocr_languages, languages, metadata_last_modified, hi_res_model_name, extract_images_in_pdf, extract_image_block_types, extract_image_block_output_dir, extract_image_block_to_payload, **kwargs)
    559         # NOTE(robinson): Catches file conversion warnings when running with PDFs
    560         with warnings.catch_warnings():
--> 561             elements = _partition_pdf_or_image_with_ocr(
    562                 filename=filename,
    563                 file=file,

[/usr/local/lib/python3.10/dist-packages/unstructured/utils.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
    212                     ),
    213                 )
--> 214             return func(*args, **kwargs)
    215 
    216         return wrapper

[/usr/local/lib/python3.10/dist-packages/unstructured/partition/pdf.py](https://localhost:8080/#) in _partition_pdf_or_image_with_ocr(filename, file, include_page_breaks, languages, is_image, metadata_last_modified, **kwargs)
    899     else:
    900         page_number = 0
--> 901         for image in convert_pdf_to_images(filename, file):
    902             page_number += 1
    903             page_elements = _partition_pdf_or_image_with_ocr_from_image(

[/usr/local/lib/python3.10/dist-packages/unstructured/partition/pdf.py](https://localhost:8080/#) in convert_pdf_to_images(filename, file, chunk_size)
    846     else:
    847         f_bytes = None
--> 848         info = pdf2image.pdfinfo_from_path(filename)
    849 
    850     total_pages = info["Pages"]

[/usr/local/lib/python3.10/dist-packages/pdf2image/pdf2image.py](https://localhost:8080/#) in pdfinfo_from_path(pdf_path, userpw, ownerpw, poppler_path, rawdates, timeout, first_page, last_page)
    605 
    606     except OSError:
--> 607         raise PDFInfoNotInstalledError(
    608             "Unable to get page count. Is poppler installed and in PATH?"
    609         )

PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?
dosubot[bot] commented 10 months ago

🤖

Based on the information you've provided and the similar issue I found in the LangChain repository, it seems that the error you're encountering is due to Poppler not being correctly installed or not found in your system's PATH. Poppler is a PDF rendering library that is used by the DirectoryLoader class in the LangChain framework to process PDF files.

The download link you used for Poppler might not have provided the correct binaries for your system. I recommend downloading the Poppler binaries from this link: Poppler Windows Releases. After downloading, you need to extract the files and add the bin directory to your system's PATH.

Here's how you can add the directory to your PATH:

  1. Press Win + X and choose System.
  2. Click on Advanced system settings.
  3. In the System Properties window that opens, click the Environment Variables button.
  4. In the Environment Variables window, under System variables, find and select the Path variable, then click Edit.
  5. In the Edit Environment Variable window, click New and add the path to the bin directory of the extracted Poppler files.

Please note that you need to restart your command prompt or IDE for the changes to take effect. If you're still encountering issues after this, please let me know.

For more information, you can refer to the following sources:

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

JayKayNJIT commented 5 months ago

Getting same error here. What was the solution? Thanks.