langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
93.05k stars 14.95k forks source link

Unable to get page count. Is poppler installed and in PATH #13838

Closed akashAD98 closed 6 months ago

akashAD98 commented 10 months ago

System Info

im running it on google Collab

Who can help?

trying the example of mult-modal rag - I tried everything no matter what if still getting this error please tell if if there is any alternative way or how can we install it? @bas

Information

Related Components

Reproduction

just run it on collab & we will not able to get output from partitions

https://github.com/langchain-ai/langchain/blob/master/cookbook/Multi_modal_RAG.ipynb

image

Expected behavior

it should work normal without error on collab

dosubot[bot] commented 10 months ago

🤖

Hello @akashAD98, it's good to see you again! I hope you're doing well.

To install poppler on Google Colab, you can use the following command in a cell:

!apt-get install -y poppler-utils 

This should install the poppler utility which is required for the operation you are trying to perform.

However, if you are still encountering the "Unable to get page count" error, it might be due to the poppler library not being able to read the PDF file correctly. This could be due to a variety of reasons such as the PDF being encrypted, corrupted, or not being fully downloaded.

As an alternative solution, you can use the PyPDF2 library to read the PDF file. Here is an example of how you can use it:

!pip install PyPDF2

import PyPDF2

def get_num_pages(pdf_path):
    with open(pdf_path, "rb") as file:
        pdf = PyPDF2.PdfFileReader(file)
        return pdf.getNumPages()

pdf_path = "/path/to/your/pdf"
print(get_num_pages(pdf_path))

This will return the number of pages in the PDF file. If PyPDF2 is also not able to read the PDF file, then the issue is likely with the PDF file itself and not with the library you are using to read it.

As for the specific role of poppler in the LangChain codebase, particularly in the multi-modal RAG example, I wasn't able to find an answer within the repository. It's possible that poppler is being used to process PDF files in some way, but without more information, it's hard to say for sure. If you could provide more details about the error message you're seeing, it might help to narrow down the problem.

Sources

kulaone commented 9 months ago

I had the same problem with XFA PDF When installing poppler-utils I got a new error: UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x7bb25f8173d0>

terwayp commented 3 months ago

I am getting the same error too. Any updates on this? UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x452315ee0>