jalan / pdftotext

Simple PDF text extraction
MIT License
870 stars 99 forks source link

Unable to install pdftotext : poppler/cpp/poppler-document.h not found #102

Closed yashali closed 2 years ago

yashali commented 2 years ago

Hi, I am trying to install pdftotext in my rhel7 based Docker image.

Here's the error I am running into while running conda env update ... :

Pip subprocess error:
  ERROR: Command errored out with exit status 1:
   command: /opt/conda/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-jrjnkq4a/pdftotext_ab40464f36954c34afc20d6c47589000/setup.py'"'"'; __file__='"'"'/tmp/pip-install-jrjnkq4a/pdftotext_ab40464f36954c34afc20d6c47589000/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-l25y3g5j
       cwd: /tmp/pip-install-jrjnkq4a/pdftotext_ab40464f36954c34afc20d6c47589000/
  Complete output (13 lines):
  running bdist_wheel
  running build
  running build_ext
  building 'pdftotext' extension
  creating build
  creating build/temp.linux-x86_64-3.8
  gcc -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DPOPPLER_CPP_AT_LEAST_0_30_0=0 -I/opt/conda/include/python3.8 -c pdftotext.cpp -o build/temp.linux-x86_64-3.8/pdftotext.o -Wall
  cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
  pdftotext.cpp:3:42: fatal error: poppler/cpp/poppler-document.h: No such file or directory
   #include <poppler/cpp/poppler-document.h>
                                            ^
  compilation terminated.

Here's my Dockerfile

FROM org-acr.azurecr.io/base/org-custom-rhel7:latest

RUN umask 0003 && \
    /bin/yum makecache fast && \
    /bin/yum install -y bzip2 wget gcc gcc-c++ libaio mesa-libGL libXt which make vim zip unzip tcsh gsl-devel libgdal-dev libspatialindex-dev && \
    yum install -y gettext-devel openssl-devel perl-CPAN perl-devel zlib-devel curl-devel && \
    /bin/yum install -y gtk3 alsa-lib libXScrnSaver poppler-utils poppler-cpp poppler-cpp-devel  inkscape 

### Add the miniconda distribution from https://repo.continuum.io/miniconda/Miniconda3-4.6.14-Linux-x86_64.sh

#COPY conda nd pip config files

USER jovyan

RUN umask 0003 && \
    conda install -n base -c defaults conda=4.9.0 && \
    conda env update --name base -f ~/env.yaml && \
    conda clean -y -a && \
    rm -rf ~/.cache/pip && \
    source activate base && \
    conda env export -n base --no-builds && \
    python -m pipdeptree --all --warn fail
... ... ...

Some of the pkgs in my env.yaml are :

name: base
channels:
  - conda-forge
  - defaults
dependencies:
  - _libgcc_mutex=0.1
  - cairo=1.16.0
  - libgcc-ng=9.3.0
  - pdf2svg=0.2.3
  - pip=20.3.3
  - poppler=0.89.0
  - poppler-data=0.4.11
  - python=3.8.6
  - pip:
    - pdftotext

I get logs during Docker build "No package poppler-cpp-devel available." "No package poppler-cpp available."

Please help me understand how to get rid of this poppler error. I tried to use this code for poppler installation but I get an error regarding SSL handshake with poppler.freedesktop.org

RUN /bin/yum install freetype-devel -y \
    && wget https://poppler.freedesktop.org/poppler-data-0.4.11.tar.gz --no-check-certificate -d \
    && tar -xf poppler-data-0.4.11.tar.gz && cd poppler-data-0.4.11 \
    && make install && cd .. \
    && wget https://poppler.freedesktop.org/poppler-22.07.0.tar.xz --no-check-certificate \
    && tar -xf poppler-22.07.0.tar.xz && cd poppler-22.07.0 \
    && mkdir build && cd build \
    && cmake .. && make && make install \
    && ldconfig && cd ../.. \
    && rm poppler-data-0.4.11.tar.gz && rm -rf poppler-data-0.4.11 \
    && rm poppler-22.07.0.tar.xz && rm -rf poppler-22.07.0
jalan commented 2 years ago

It looks like you're using conda. Somebody put this package on conda forge (https://anaconda.org/conda-forge/pdftotext), so you can just use that.

yashali commented 2 years ago

I get the same error with conda/pdftotext :(

jalan commented 2 years ago

I get the same error with conda/pdftotext :(

No, you don't. You don't get a compiler error if you try to install pdftotext from conda-forge, because there isn't any compilation happening. For example:

$ docker run -it continuumio/miniconda3:4.12.0 bash
(base) root@aabecf517e7a:/# conda install --channel conda-forge --quiet --yes pdftotext
[output snipped]
(base) root@aabecf517e7a:/# conda list | grep pdftotext
pdftotext                 2.2.2            py39h0cd543a_0    conda-forge
yashali commented 2 years ago

Oh my bad, let me try to get a version (from conda) that doesn't conflict with other conda pkg versions in my env.yaml. That's the error i get with conda.