Closed KaifAhmad1 closed 8 months ago
🤖
Hey @KaifAhmad1! Good to see you again. Thanks for reaching out with your issue. 🚀
The error you're encountering is related to the pdf2image
package, which is used to convert PDF files into images. The error message suggests that the poppler
utility, which pdf2image
relies on, is not installed or not in the system's PATH.
The DirectoryLoader
class in LangChain, which you're using in your code, is responsible for loading documents from a specified directory. However, it doesn't directly use the pdf2image
package. The usage of pdf2image
might be in the specific loader class for PDF files, which is used by DirectoryLoader
when it encounters a PDF file.
There are a couple of similar issues in the LangChain repository, but they don't seem to be directly related to your problem. The first one suggests trying a different version of the Pillow package, and the second one is about an error with the VectorstoreIndexCreator
class.
To resolve your issue, you need to install poppler
on your system and make sure it's in your system's PATH. The process to do this varies depending on your operating system. Since you're using Windows, you can download a precompiled binary of poppler
for Windows from this link. After downloading and extracting the files, you need to add the bin
directory to your system's PATH.
Here's how you can do it:
bin
directory of the extracted poppler
files.After doing this, try running your code again. If you still encounter the same error, it might be because the system environment variables haven't been updated in your current command prompt or IDE. Try closing and reopening it, then run your code again.
If you're still having trouble, please provide more details about your setup and the exact steps you're taking so I can better assist you.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
Hey @dosu-bot,
Thanks for the guidance! 🙌 The poppler download link seems to be a placeholder. Can you provide the correct one?
Appreciate your help, @KaifAhmad1
🤖
Sure, I can help with that. The error message you're seeing is indicating that the poppler
utility is not installed on your system. poppler
is a PDF rendering library which is used by the pdf2image
Python package to convert PDF files into images.
You can download and install poppler
for Windows from this link: Poppler for Windows. After downloading, you need to add the bin
directory from the downloaded package to your PATH environment variable.
Here are the steps to do this:
bin
directory in the Poppler package to the end of the Variable value, separated from the existing paths by a semicolon.After doing this, you should be able to use the pdf2image
package without encountering the PDFInfoNotInstalledError
.
If you still encounter issues after following these steps, please let me know!
For more information, you can refer to the Poppler for Windows website and the pdf2image package documentation.
I hope this helps! If you have any other questions, feel free to ask.
-Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
Hey @dosubot
The poppler download link seems to be a placeholder. Can you provide the correct one? This is the link Poppler for Windows which is not opening. Can you provide an alternative source?
@KaifAhmad1
🤖
I'm sorry for the inconvenience you're experiencing. The error you're encountering is due to the absence of the poppler
utility in your system's PATH. poppler
is a PDF rendering library, which is used by the pdf2image
package to convert PDF files into images.
The link you provided for Poppler for Windows seems to be down. As an alternative, you can download precompiled Poppler binaries for Windows from this GitHub repository. After downloading, you need to add the bin
directory from the downloaded package to your system's PATH.
Here's how you can do it:
bin
directory from the extracted folder to your system's PATH.To add the directory to your system's PATH, follow these steps:
Win + X
and choose System
.Advanced system settings
.System Properties
window that opens, click on the Environment Variables
button.Environment Variables
window, under System variables
, scroll down and select Path
, then click on Edit
.Edit environment variable
window, click on New
and then paste the full path of the bin
directory.OK
on all windows to save the changes.After doing this, you should be able to use poppler
from any command prompt or terminal window. Please try running your code again after following these steps.
If you're still encountering issues, please let me know!
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
Hey, @dosubot
I have downloaded poppler 23.08.0
and initialised the bin path in my environment variable but still I am getting same error.
directory = '/content/drive/MyDrive/History_QA_dataset'
from langchain_community.document_loaders import DirectoryLoader
loader = DirectoryLoader(directory)
documents = loader.load()
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/pdf2image/pdf2image.py](https://localhost:8080/#) in pdfinfo_from_path(pdf_path, userpw, ownerpw, poppler_path, rawdates, timeout, first_page, last_page)
580 env["LD_LIBRARY_PATH"] = poppler_path + ":" + env.get("LD_LIBRARY_PATH", "")
--> 581 proc = Popen(command, env=env, stdout=PIPE, stderr=PIPE)
582
18 frames
[/usr/lib/python3.10/subprocess.py](https://localhost:8080/#) in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, user, group, extra_groups, encoding, errors, text, umask, pipesize)
970
--> 971 self._execute_child(args, executable, preexec_fn, close_fds,
972 pass_fds, cwd, env,
[/usr/lib/python3.10/subprocess.py](https://localhost:8080/#) in _execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, restore_signals, gid, gids, uid, umask, start_new_session)
1862 err_msg = os.strerror(errno_num)
-> 1863 raise child_exception_type(errno_num, err_msg, err_filename)
1864 raise child_exception_type(err_msg)
FileNotFoundError: [Errno 2] No such file or directory: 'pdfinfo'
During handling of the above exception, another exception occurred:
PDFInfoNotInstalledError Traceback (most recent call last)
[<ipython-input-19-78f443fb7723>](https://localhost:8080/#) in <cell line: 2>()
1 loader = DirectoryLoader(directory)
----> 2 documents = loader.load()
[/usr/local/lib/python3.10/dist-packages/langchain_community/document_loaders/directory.py](https://localhost:8080/#) in load(self)
155 else:
156 for i in items:
--> 157 self.load_file(i, p, docs, pbar)
158
159 if pbar:
[/usr/local/lib/python3.10/dist-packages/langchain_community/document_loaders/directory.py](https://localhost:8080/#) in load_file(self, item, path, docs, pbar)
104 logger.warning(f"Error loading file {str(item)}: {e}")
105 else:
--> 106 raise e
107 finally:
108 if pbar:
[/usr/local/lib/python3.10/dist-packages/langchain_community/document_loaders/directory.py](https://localhost:8080/#) in load_file(self, item, path, docs, pbar)
98 try:
99 logger.debug(f"Processing file: {str(item)}")
--> 100 sub_docs = self.loader_cls(str(item), **self.loader_kwargs).load()
101 docs.extend(sub_docs)
102 except Exception as e:
[/usr/local/lib/python3.10/dist-packages/langchain_community/document_loaders/unstructured.py](https://localhost:8080/#) in load(self)
85 def load(self) -> List[Document]:
86 """Load file."""
---> 87 elements = self._get_elements()
88 self._post_process_elements(elements)
89 if self.mode == "elements":
[/usr/local/lib/python3.10/dist-packages/langchain_community/document_loaders/unstructured.py](https://localhost:8080/#) in _get_elements(self)
171 from unstructured.partition.auto import partition
172
--> 173 return partition(filename=self.file_path, **self.unstructured_kwargs)
174
175 def _get_metadata(self) -> dict:
[/usr/local/lib/python3.10/dist-packages/unstructured/partition/auto.py](https://localhost:8080/#) in partition(filename, content_type, file, file_filename, url, include_page_breaks, strategy, encoding, paragraph_grouper, headers, skip_infer_table_types, ssl_verify, ocr_languages, languages, detect_language_per_element, pdf_infer_table_structure, extract_images_in_pdf, extract_image_block_types, extract_image_block_output_dir, extract_image_block_to_payload, xml_keep_tags, data_source_metadata, metadata_filename, request_timeout, hi_res_model_name, model_name, **kwargs)
407 elif filetype == FileType.PDF:
408 _partition_pdf = _get_partition_with_extras("pdf")
--> 409 elements = _partition_pdf(
410 filename=filename, # type: ignore
411 file=file, # type: ignore
[/usr/local/lib/python3.10/dist-packages/unstructured/documents/elements.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
516 @functools.wraps(func)
517 def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> List[Element]:
--> 518 elements = func(*args, **kwargs)
519 sig = inspect.signature(func)
520 params: Dict[str, Any] = dict(**dict(zip(sig.parameters, args)), **kwargs)
[/usr/local/lib/python3.10/dist-packages/unstructured/file_utils/filetype.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
589 @functools.wraps(func)
590 def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> List[Element]:
--> 591 elements = func(*args, **kwargs)
592 sig = inspect.signature(func)
593 params: Dict[str, Any] = dict(**dict(zip(sig.parameters, args)), **kwargs)
[/usr/local/lib/python3.10/dist-packages/unstructured/file_utils/filetype.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
544 @functools.wraps(func)
545 def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> List[Element]:
--> 546 elements = func(*args, **kwargs)
547 sig = inspect.signature(func)
548 params: Dict[str, Any] = dict(**dict(zip(sig.parameters, args)), **kwargs)
[/usr/local/lib/python3.10/dist-packages/unstructured/chunking/__init__.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
50 @functools.wraps(func)
51 def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> List[Element]:
---> 52 elements = func(*args, **kwargs)
53 sig = inspect.signature(func)
54 params: Dict[str, Any] = dict(**dict(zip(sig.parameters, args)), **kwargs)
[/usr/local/lib/python3.10/dist-packages/unstructured/partition/pdf.py](https://localhost:8080/#) in partition_pdf(filename, file, include_page_breaks, strategy, infer_table_structure, ocr_languages, languages, include_metadata, metadata_filename, metadata_last_modified, chunking_strategy, links, hi_res_model_name, extract_images_in_pdf, extract_image_block_types, extract_image_block_output_dir, extract_image_block_to_payload, **kwargs)
204 languages = check_languages(languages, ocr_languages)
205
--> 206 return partition_pdf_or_image(
207 filename=filename,
208 file=file,
[/usr/local/lib/python3.10/dist-packages/unstructured/partition/pdf.py](https://localhost:8080/#) in partition_pdf_or_image(filename, file, is_image, include_page_breaks, strategy, infer_table_structure, ocr_languages, languages, metadata_last_modified, hi_res_model_name, extract_images_in_pdf, extract_image_block_types, extract_image_block_output_dir, extract_image_block_to_payload, **kwargs)
559 # NOTE(robinson): Catches file conversion warnings when running with PDFs
560 with warnings.catch_warnings():
--> 561 elements = _partition_pdf_or_image_with_ocr(
562 filename=filename,
563 file=file,
[/usr/local/lib/python3.10/dist-packages/unstructured/utils.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
212 ),
213 )
--> 214 return func(*args, **kwargs)
215
216 return wrapper
[/usr/local/lib/python3.10/dist-packages/unstructured/partition/pdf.py](https://localhost:8080/#) in _partition_pdf_or_image_with_ocr(filename, file, include_page_breaks, languages, is_image, metadata_last_modified, **kwargs)
899 else:
900 page_number = 0
--> 901 for image in convert_pdf_to_images(filename, file):
902 page_number += 1
903 page_elements = _partition_pdf_or_image_with_ocr_from_image(
[/usr/local/lib/python3.10/dist-packages/unstructured/partition/pdf.py](https://localhost:8080/#) in convert_pdf_to_images(filename, file, chunk_size)
846 else:
847 f_bytes = None
--> 848 info = pdf2image.pdfinfo_from_path(filename)
849
850 total_pages = info["Pages"]
[/usr/local/lib/python3.10/dist-packages/pdf2image/pdf2image.py](https://localhost:8080/#) in pdfinfo_from_path(pdf_path, userpw, ownerpw, poppler_path, rawdates, timeout, first_page, last_page)
605
606 except OSError:
--> 607 raise PDFInfoNotInstalledError(
608 "Unable to get page count. Is poppler installed and in PATH?"
609 )
PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?
🤖
Based on the information you've provided and the similar issue I found in the LangChain repository, it seems that the error you're encountering is due to Poppler not being correctly installed or not found in your system's PATH. Poppler is a PDF rendering library that is used by the DirectoryLoader
class in the LangChain framework to process PDF files.
The download link you used for Poppler might not have provided the correct binaries for your system. I recommend downloading the Poppler binaries from this link: Poppler Windows Releases. After downloading, you need to extract the files and add the bin
directory to your system's PATH.
Here's how you can add the directory to your PATH:
Win + X
and choose System
.Advanced system settings
.System Properties
window that opens, click the Environment Variables
button.Environment Variables
window, under System variables
, find and select the Path
variable, then click Edit
.Edit Environment Variable
window, click New
and add the path to the bin
directory of the extracted Poppler files.Please note that you need to restart your command prompt or IDE for the changes to take effect. If you're still encountering issues after this, please let me know.
For more information, you can refer to the following sources:
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
Getting same error here. What was the solution? Thanks.
Checked other resources
Example Code
Description
System Info
Windows: 11 pip == 23.3.1 python == 3.10.10 langchain == 0.1.0 transformers == 4.36.2 sentence_transformers == 2.2.2 unstructured == 0.12.0
Related Components