Issue: Pypdf extract_image=True is not working on docker(production)

akash97715 commented 8 months ago

Issue you'd like to raise.

Hello Team,

we are trying to use pypdf to get the text out from the pdf use the chunk for embedding(details are there in attached code snippet. while using, i have installed all the required packages. its working fine in my local(windows 10). same code snippet and requirement.txt if i use on docker that uses ubuntu OS(its in prod), i am getting below error:

Error while chunking the file: Error while chunking the file, Errored while loading the document: rapidocr-onnxruntime package not found, please install it with `pip install rapidocr-onnxruntime`

The strage part here is this rapidocr-onnxruntime package is already installed on the ubuntu system(i re-verified by seeing the github action runner logs where it installs all the packages from requirement.txt)

Not able to understand why on prod it pypdf package with extract_image=True is throwing above error

It will be helpful, if you can provide any insight or workaround to this issue. process_pdff

Suggestion:

No response

dosubot[bot] commented 8 months ago

🤖

Hey there @akash97715! Hope you're doing well. It's good to see you again, let's dive into this issue together.

The error you're encountering seems to be originating from the extract_from_images_with_rapidocr function in the pdf.py file. This function attempts to import the rapidocr_onnxruntime package and raises an ImportError if it fails. The rapidocr_onnxruntime package is used to extract text from images when extract_images is set to True.

Even though the GitHub action runner logs show that the rapidocr_onnxruntime package is installed, the error could be due to several reasons:

The package might not be installed in the same Python environment that the application is running in. This can happen if there are multiple Python environments on the system.
The package might not be installed correctly. This can happen if there were errors during the installation process.
The package might not be compatible with the system. This can happen if the package requires specific system libraries that are not installed or if the package is not compatible with the version of Python that the application is using.

To resolve this issue, you can try the following:

Ensure that the rapidocr_onnxruntime package is installed in the same Python environment that the application is running in. You can do this by running pip show rapidocr-onnxruntime in the same environment.
Try reinstalling the rapidocr-onnxruntime package. You can do this by running pip uninstall rapidocr-onnxruntime followed by pip install rapidocr-onnxruntime.
Check the compatibility of the rapidocr-onnxruntime package with your system and Python version. You can do this by checking the package documentation or the package page on PyPI.

If none of these solutions work, you might need to look into other potential issues with your system or application configuration.

Also, it would be helpful if you could provide the exact versions of the rapidocr-onnxruntime and pypdf packages that are being used in the Docker production environment. This information could help in diagnosing the issue.

You can find more details about the extract_from_images_with_rapidocr function in the pdf.py file here.

I hope this helps! If you have any other questions or need further clarification, feel free to ask.

-Dosu

Sources

libs/community/langchain_community/document_loaders/parsers/pdf.py

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

BYZANTINE26 commented 7 months ago

I am facing the same issue, @akash97715 were you able to resolve it??

BYZANTINE26 commented 7 months ago

>>> from rapidocr_onnxruntime import RapidOCR
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.9/site-packages/rapidocr_onnxruntime/__init__.py", line 4, in <module>
    from .main import RapidOCR
  File "/usr/local/lib/python3.9/site-packages/rapidocr_onnxruntime/main.py", line 8, in <module>
    import cv2
  File "/usr/local/lib/python3.9/site-packages/cv2/__init__.py", line 181, in <module>
    bootstrap()
  File "/usr/local/lib/python3.9/site-packages/cv2/__init__.py", line 153, in bootstrap
    native_module = importlib.import_module("cv2")
  File "/usr/local/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
ImportError: libGL.so.1: cannot open shared object file: No such file or directory

I found OpenCV is trying to use OpenGL (libGL.so.1), but it's failing to find the necessary OpenGL library.

But I am unable to get OpenGL library on my docker container.

MaisieHu8891 commented 6 months ago

I am facing the same issue, only error in docker , in my mac is correct:

and I'm sure docker has installed rapidocr-onnxruntime package

self.loader = PyMuPDFLoader(self.file_path, extract_images=True) 024-03-15 13:43:42,811 - root - ERROR - rapidocr-onnxruntime package not found, please install it with pip install rapidocr-onnxruntime

please help me resolve it

MaisieHu8891 commented 6 months ago

I am facing the same issue, only error in docker , in my mac is correct:

and I'm sure docker has installed rapidocr-onnxruntime package

self.loader = PyMuPDFLoader(self.file_path, extract_images=True) 024-03-15 13:43:42,811 - root - ERROR - rapidocr-onnxruntime package not found, please install it with pip install rapidocr-onnxruntime

please help me resolve it

centos6 docker ," sudo yum install mesa-libGL" can resolve it

CIBIO-TropiBIO commented 5 months ago

same issue

bhargav-xponai commented 4 months ago

in linux OS add this to you dockerfile.

RUN apt-get update && apt-get install -y --no-install-recommends \
    libgl1-mesa-glx \  
    libglib2.0-0 \
    && rm -rf /var/lib/apt/lists/*

box9527 commented 3 months ago

in linux OS add this to you dockerfile.

RUN apt-get update && apt-get install -y --no-install-recommends \
    libgl1-mesa-glx \  
    libglib2.0-0 \
    && rm -rf /var/lib/apt/lists/*

@bhargav-xponai 's solution is working perfectly for me. Thanks to save my day.

langchain-ai / langchain