izderadicka / pdfparser

Python binding to libpoppler with focus on text extraction
98 stars 46 forks source link

macOS install seems broken. #29

Closed jaihindhreddy closed 3 years ago

jaihindhreddy commented 4 years ago

I'm unable to install the package using the instructions.

I'm running macOS Mojave 10.14.6 (18G4032).

I installed poppler with brew install poppler and got the 0.87.0 version. I tried with Python 2.7.17 and 3.7.7.

clang --version output:

Apple LLVM version 10.0.1 (clang-1001.0.46.4)
Target: x86_64-apple-darwin18.7.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

pip3 install git+https://github.com/izderadicka/pdfparser fails with the following:

Jaihindhs-MacBook-Pro:~ jaihindhreddy$ pip3 install git+https://github.com/izderadicka/pdfparser
Collecting git+https://github.com/izderadicka/pdfparser
  Cloning https://github.com/izderadicka/pdfparser to /private/var/folders/b_/vjvtnssn73dgczf8x0gfyldm0000gn/T/pip-req-build-nncuvxjr
  Running command git clone -q https://github.com/izderadicka/pdfparser /private/var/folders/b_/vjvtnssn73dgczf8x0gfyldm0000gn/T/pip-req-build-nncuvxjr
Requirement already satisfied: cython in /usr/local/lib/python3.7/site-packages (from pdfparser==0.1.3) (0.29.13)
Building wheels for collected packages: pdfparser
  Building wheel for pdfparser (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /usr/local/opt/python/bin/python3.7 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/b_/vjvtnssn73dgczf8x0gfyldm0000gn/T/pip-req-build-nncuvxjr/setup.py'"'"'; __file__='"'"'/private/var/folders/b_/vjvtnssn73dgczf8x0gfyldm0000gn/T/pip-req-build-nncuvxjr/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /private/var/folders/b_/vjvtnssn73dgczf8x0gfyldm0000gn/T/pip-wheel-5__05z25
       cwd: /private/var/folders/b_/vjvtnssn73dgczf8x0gfyldm0000gn/T/pip-req-build-nncuvxjr/
  Complete output (84 lines):
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.macosx-10.14-x86_64-3.7
  creating build/lib.macosx-10.14-x86_64-3.7/pdfparser
  copying pdfparser/__init__.py -> build/lib.macosx-10.14-x86_64-3.7/pdfparser
  running egg_info
  creating pdfparser.egg-info
  writing pdfparser.egg-info/PKG-INFO
  writing dependency_links to pdfparser.egg-info/dependency_links.txt
  writing requirements to pdfparser.egg-info/requires.txt
  writing top-level names to pdfparser.egg-info/top_level.txt
  writing manifest file 'pdfparser.egg-info/SOURCES.txt'
  reading manifest file 'pdfparser.egg-info/SOURCES.txt'
  writing manifest file 'pdfparser.egg-info/SOURCES.txt'
  copying pdfparser/poppler.pyx -> build/lib.macosx-10.14-x86_64-3.7/pdfparser
  running build_ext
  cythoning pdfparser/poppler.pyx to pdfparser/poppler.cpp
  /usr/local/lib/python3.7/site-packages/Cython/Compiler/Main.py:369: FutureWarning: Cython directive 'language_level' not set, using 2 for now (Py2). This will change in a later release! File: /private/var/folders/b_/vjvtnssn73dgczf8x0gfyldm0000gn/T/pip-req-build-nncuvxjr/pdfparser/poppler.pyx
    tree = Parsing.p_module(s, pxd, full_module_name)
  building 'pdfparser.poppler' extension
  creating build/temp.macosx-10.14-x86_64-3.7
  creating build/temp.macosx-10.14-x86_64-3.7/pdfparser
  clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk -I/Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk/usr/include -I/Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -I/usr/local/opt/openssl/include -I/usr/local/Cellar/poppler/0.87.0/include/poppler -I/usr/local/Cellar/poppler/0.87.0/include/poppler/cpp -I/usr/local/Cellar/poppler/0.87.0/include/poppler -I/usr/local/include -I/usr/local/opt/openssl@1.1/include -I/usr/local/opt/sqlite/include -I/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/include/python3.7m -c pdfparser/poppler.cpp -o build/temp.macosx-10.14-x86_64-3.7/pdfparser/poppler.o -std=c++11 -stdlib=libc++ -mmacosx-version-min=10.7
  pdfparser/poppler.cpp:812:11: warning: 'likely' macro redefined [-Wmacro-redefined]
    #define likely(x)   __builtin_expect(!!(x), 1)
            ^
  /usr/local/Cellar/poppler/0.87.0/include/poppler/goo/GooLikely.h:15:10: note: previous definition is here
  # define likely(x)      __builtin_expect((x), 1)
           ^
  pdfparser/poppler.cpp:813:11: warning: 'unlikely' macro redefined [-Wmacro-redefined]
    #define unlikely(x) __builtin_expect(!!(x), 0)
            ^
  /usr/local/Cellar/poppler/0.87.0/include/poppler/goo/GooLikely.h:16:10: note: previous definition is here
  # define unlikely(x)    __builtin_expect((x), 0)
           ^
  pdfparser/poppler.cpp:2942:49: error: assigning to 'TextFlow *' from incompatible type 'const TextFlow *'
    __pyx_v_self->curr_flow = __pyx_v_self->page->getFlows();
                              ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~
  pdfparser/poppler.cpp:3173:54: error: assigning to 'TextFlow *' from incompatible type 'const TextFlow *'
    __pyx_v_self->curr_flow = __pyx_v_self->curr_flow->getNext();
                              ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~
  pdfparser/poppler.cpp:3525:50: error: assigning to 'TextBlock *' from incompatible type 'const TextBlock *'
    __pyx_v_self->curr_block = __pyx_v_self->flow->getBlocks();
                               ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~
  pdfparser/poppler.cpp:3674:56: error: assigning to 'TextBlock *' from incompatible type 'const TextBlock *'
    __pyx_v_self->curr_block = __pyx_v_self->curr_block->getNext();
                               ~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~
  pdfparser/poppler.cpp:3901:50: error: assigning to 'TextLine *' from incompatible type 'const TextLine *'
    __pyx_v_self->curr_line = __pyx_v_self->block->getLines();
                              ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~
  pdfparser/poppler.cpp:4050:54: error: assigning to 'TextLine *' from incompatible type 'const TextLine *'
    __pyx_v_self->curr_line = __pyx_v_self->curr_line->getNext();
                              ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~
  pdfparser/poppler.cpp:8590:35: error: assigning to 'TextWord *' from incompatible type 'const TextWord *'
    __pyx_v_w = __pyx_v_self->line->getWords();
                ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~
  pdfparser/poppler.cpp:8768:38: error: assigning to 'GooString *' from incompatible type 'const GooString *'
        __pyx_v_font_name = __pyx_v_w->getFontName(__pyx_v_i);
                            ~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
  pdfparser/poppler.cpp:9184:28: error: assigning to 'TextWord *' from incompatible type 'const TextWord *'
      __pyx_v_w = __pyx_v_w->getNext();
                  ~~~~~~~~~~~^~~~~~~~~
  pdfparser/poppler.cpp:12370:16: error: no viable overloaded '='
    globalParams = __pyx_t_2;
    ~~~~~~~~~~~~ ^ ~~~~~~~~~
  /Library/Developer/CommandLineTools/usr/include/c++/v1/memory:2397:28: note: candidate function (the implicit copy assignment operator) not viable: no known conversion from 'GlobalParams *' to 'const std::__1::unique_ptr<GlobalParams, std::__1::default_delete<GlobalParams> >' for 1st argument
  class _LIBCPP_TEMPLATE_VIS unique_ptr {
                             ^
  /Library/Developer/CommandLineTools/usr/include/c++/v1/memory:2513:15: note: candidate function not viable: no known conversion from 'GlobalParams *' to 'std::__1::unique_ptr<GlobalParams, std::__1::default_delete<GlobalParams> >' for 1st argument
    unique_ptr& operator=(unique_ptr&& __u) _NOEXCEPT {
                ^
  /Library/Developer/CommandLineTools/usr/include/c++/v1/memory:2605:15: note: candidate function not viable: no known conversion from 'GlobalParams *' to 'std::nullptr_t' (aka 'nullptr_t') for 1st argument
    unique_ptr& operator=(nullptr_t) _NOEXCEPT {
                ^
  /Library/Developer/CommandLineTools/usr/include/c++/v1/memory:2524:15: note: candidate template ignored: could not match 'unique_ptr<type-parameter-0-0, type-parameter-0-1>' against 'GlobalParams *'
    unique_ptr& operator=(unique_ptr<_Up, _Ep>&& __u) _NOEXCEPT {
                ^
  /Library/Developer/CommandLineTools/usr/include/c++/v1/memory:2595:7: note: candidate template ignored: could not match 'auto_ptr<type-parameter-0-0>' against 'GlobalParams *'
        operator=(auto_ptr<_Up> __p) {
        ^
  2 warnings and 10 errors generated.
  error: command 'clang' failed with exit status 1
  ----------------------------------------
  ERROR: Failed building wheel for pdfparser
  Running setup.py clean for pdfparser
Failed to build pdfparser
Installing collected packages: pdfparser
    Running setup.py install for pdfparser ... error
    ERROR: Command errored out with exit status 1:
     command: /usr/local/opt/python/bin/python3.7 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/b_/vjvtnssn73dgczf8x0gfyldm0000gn/T/pip-req-build-nncuvxjr/setup.py'"'"'; __file__='"'"'/private/var/folders/b_/vjvtnssn73dgczf8x0gfyldm0000gn/T/pip-req-build-nncuvxjr/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/b_/vjvtnssn73dgczf8x0gfyldm0000gn/T/pip-record-ckf1uu51/install-record.txt --single-version-externally-managed --compile --install-headers /usr/local/include/python3.7m/pdfparser
         cwd: /private/var/folders/b_/vjvtnssn73dgczf8x0gfyldm0000gn/T/pip-req-build-nncuvxjr/
    Complete output (80 lines):
    running install
    running build
    running build_py
    creating build
    creating build/lib.macosx-10.14-x86_64-3.7
    creating build/lib.macosx-10.14-x86_64-3.7/pdfparser
    copying pdfparser/__init__.py -> build/lib.macosx-10.14-x86_64-3.7/pdfparser
    running egg_info
    writing pdfparser.egg-info/PKG-INFO
    writing dependency_links to pdfparser.egg-info/dependency_links.txt
    writing requirements to pdfparser.egg-info/requires.txt
    writing top-level names to pdfparser.egg-info/top_level.txt
    reading manifest file 'pdfparser.egg-info/SOURCES.txt'
    writing manifest file 'pdfparser.egg-info/SOURCES.txt'
    copying pdfparser/poppler.pyx -> build/lib.macosx-10.14-x86_64-3.7/pdfparser
    running build_ext
    skipping 'pdfparser/poppler.cpp' Cython extension (up-to-date)
    building 'pdfparser.poppler' extension
    creating build/temp.macosx-10.14-x86_64-3.7
    creating build/temp.macosx-10.14-x86_64-3.7/pdfparser
    clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk -I/Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk/usr/include -I/Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -I/usr/local/opt/openssl/include -I/usr/local/Cellar/poppler/0.87.0/include/poppler -I/usr/local/Cellar/poppler/0.87.0/include/poppler/cpp -I/usr/local/Cellar/poppler/0.87.0/include/poppler -I/usr/local/include -I/usr/local/opt/openssl@1.1/include -I/usr/local/opt/sqlite/include -I/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/include/python3.7m -c pdfparser/poppler.cpp -o build/temp.macosx-10.14-x86_64-3.7/pdfparser/poppler.o -std=c++11 -stdlib=libc++ -mmacosx-version-min=10.7
    pdfparser/poppler.cpp:812:11: warning: 'likely' macro redefined [-Wmacro-redefined]
      #define likely(x)   __builtin_expect(!!(x), 1)
              ^
    /usr/local/Cellar/poppler/0.87.0/include/poppler/goo/GooLikely.h:15:10: note: previous definition is here
    # define likely(x)      __builtin_expect((x), 1)
             ^
    pdfparser/poppler.cpp:813:11: warning: 'unlikely' macro redefined [-Wmacro-redefined]
      #define unlikely(x) __builtin_expect(!!(x), 0)
              ^
    /usr/local/Cellar/poppler/0.87.0/include/poppler/goo/GooLikely.h:16:10: note: previous definition is here
    # define unlikely(x)    __builtin_expect((x), 0)
             ^
    pdfparser/poppler.cpp:2942:49: error: assigning to 'TextFlow *' from incompatible type 'const TextFlow *'
      __pyx_v_self->curr_flow = __pyx_v_self->page->getFlows();
                                ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~
    pdfparser/poppler.cpp:3173:54: error: assigning to 'TextFlow *' from incompatible type 'const TextFlow *'
      __pyx_v_self->curr_flow = __pyx_v_self->curr_flow->getNext();
                                ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~
    pdfparser/poppler.cpp:3525:50: error: assigning to 'TextBlock *' from incompatible type 'const TextBlock *'
      __pyx_v_self->curr_block = __pyx_v_self->flow->getBlocks();
                                 ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~
    pdfparser/poppler.cpp:3674:56: error: assigning to 'TextBlock *' from incompatible type 'const TextBlock *'
      __pyx_v_self->curr_block = __pyx_v_self->curr_block->getNext();
                                 ~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~
    pdfparser/poppler.cpp:3901:50: error: assigning to 'TextLine *' from incompatible type 'const TextLine *'
      __pyx_v_self->curr_line = __pyx_v_self->block->getLines();
                                ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~
    pdfparser/poppler.cpp:4050:54: error: assigning to 'TextLine *' from incompatible type 'const TextLine *'
      __pyx_v_self->curr_line = __pyx_v_self->curr_line->getNext();
                                ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~
    pdfparser/poppler.cpp:8590:35: error: assigning to 'TextWord *' from incompatible type 'const TextWord *'
      __pyx_v_w = __pyx_v_self->line->getWords();
                  ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~
    pdfparser/poppler.cpp:8768:38: error: assigning to 'GooString *' from incompatible type 'const GooString *'
          __pyx_v_font_name = __pyx_v_w->getFontName(__pyx_v_i);
                              ~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
    pdfparser/poppler.cpp:9184:28: error: assigning to 'TextWord *' from incompatible type 'const TextWord *'
        __pyx_v_w = __pyx_v_w->getNext();
                    ~~~~~~~~~~~^~~~~~~~~
    pdfparser/poppler.cpp:12370:16: error: no viable overloaded '='
      globalParams = __pyx_t_2;
      ~~~~~~~~~~~~ ^ ~~~~~~~~~
    /Library/Developer/CommandLineTools/usr/include/c++/v1/memory:2397:28: note: candidate function (the implicit copy assignment operator) not viable: no known conversion from 'GlobalParams *' to 'const std::__1::unique_ptr<GlobalParams, std::__1::default_delete<GlobalParams> >' for 1st argument
    class _LIBCPP_TEMPLATE_VIS unique_ptr {
                               ^
    /Library/Developer/CommandLineTools/usr/include/c++/v1/memory:2513:15: note: candidate function not viable: no known conversion from 'GlobalParams *' to 'std::__1::unique_ptr<GlobalParams, std::__1::default_delete<GlobalParams> >' for 1st argument
      unique_ptr& operator=(unique_ptr&& __u) _NOEXCEPT {
                  ^
    /Library/Developer/CommandLineTools/usr/include/c++/v1/memory:2605:15: note: candidate function not viable: no known conversion from 'GlobalParams *' to 'std::nullptr_t' (aka 'nullptr_t') for 1st argument
      unique_ptr& operator=(nullptr_t) _NOEXCEPT {
                  ^
    /Library/Developer/CommandLineTools/usr/include/c++/v1/memory:2524:15: note: candidate template ignored: could not match 'unique_ptr<type-parameter-0-0, type-parameter-0-1>' against 'GlobalParams *'
      unique_ptr& operator=(unique_ptr<_Up, _Ep>&& __u) _NOEXCEPT {
                  ^
    /Library/Developer/CommandLineTools/usr/include/c++/v1/memory:2595:7: note: candidate template ignored: could not match 'auto_ptr<type-parameter-0-0>' against 'GlobalParams *'
          operator=(auto_ptr<_Up> __p) {
          ^
    2 warnings and 10 errors generated.
    error: command 'clang' failed with exit status 1
    ----------------------------------------
ERROR: Command errored out with exit status 1: /usr/local/opt/python/bin/python3.7 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/b_/vjvtnssn73dgczf8x0gfyldm0000gn/T/pip-req-build-nncuvxjr/setup.py'"'"'; __file__='"'"'/private/var/folders/b_/vjvtnssn73dgczf8x0gfyldm0000gn/T/pip-req-build-nncuvxjr/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/b_/vjvtnssn73dgczf8x0gfyldm0000gn/T/pip-record-ckf1uu51/install-record.txt --single-version-externally-managed --compile --install-headers /usr/local/include/python3.7m/pdfparser Check the logs for full command output.

Am I doing something wrong here?

bzamecnik commented 3 years ago

Check this fork which works on Mac: https://github.com/rossumai/pdfparser (+ installation instructions and brew formulas for poppler).

Anyway pdfparser uses a deprecated internal API (xpdf & cairo) for poppler and there's a better alternative which uses the CPP API and is much faster for both image rendering and text extraction: https://pypi.org/project/python-poppler/