izderadicka / pdfparser

Python binding to libpoppler with focus on text extraction
97 stars 45 forks source link

Compiling pdfparser with poppler 22.10.0 #33

Open GideonK opened 1 year ago

GideonK commented 1 year ago

I am trying to install a tool that makes of use of a version of pdfparser from some years ago. The developer has advised me to install the newest version and adapt it accordingly. I have compiled the newest poppler library, 22.10.0, (with some difficulty) and followed the steps in build_poppler.sh, with the exception that I didn't clone from the specific named version of poppler (0.62.0), then followed by cmake and make, copying the relevant .so files, and then running python setup.py install.

It would seem that the expected file structure for poppler has changed since pdfparser's latest release, so I had to edit many header files. This now seems to be fixed, but there are some warnings and errors based on the code, and so I would like to know if a relatively quick fix is possible, or perhaps you know whether some forks address this, or whether I have to give up on this for now. I'm listing the output. I'm working on Pop!_OS 22.04 LTS, which is Ubuntu based. Thanks for your consideration.

$ python setup.py install
running install
/usr/lib/python3/dist-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
/usr/lib/python3/dist-packages/setuptools/command/easy_install.py:158: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
/usr/lib/python3/dist-packages/pkg_resources/__init__.py:116: PkgResourcesDeprecationWarning: 0.1.43ubuntu1 is an invalid version and will not be supported in a future release
  warnings.warn(
/usr/lib/python3/dist-packages/pkg_resources/__init__.py:116: PkgResourcesDeprecationWarning: -VERSION- is an invalid version and will not be supported in a future release
  warnings.warn(
running bdist_egg
running egg_info
writing pdfparser.egg-info/PKG-INFO
writing dependency_links to pdfparser.egg-info/dependency_links.txt
writing requirements to pdfparser.egg-info/requires.txt
writing top-level names to pdfparser.egg-info/top_level.txt
reading manifest file 'pdfparser.egg-info/SOURCES.txt'
writing manifest file 'pdfparser.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
running build_ext
skipping 'pdfparser/poppler.cpp' Cython extension (up-to-date)
building 'pdfparser.poppler' extension
x86_64-linux-gnu-gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/local/include/poppler -I/usr/local/include/poppler/cpp -I/usr/local/include/poppler -I/home/user/.virtualenvs/ontr_env/include -I/usr/include/python3.10 -c pdfparser/poppler.cpp -o build/temp.linux-x86_64-3.10/pdfparser/poppler.o
pdfparser/poppler.cpp:961: warning: "likely" redefined
  961 |   #define likely(x)   __builtin_expect(!!(x), 1)
      | 
In file included from pdfparser/../poppler_src/poppler/Object.h:44,
                 from pdfparser/../poppler_src/poppler/OutputDev.h:42,
                 from pdfparser/poppler.cpp:769:
pdfparser/../poppler_src/poppler/../goo/GooLikely.h:15: note: this is the location of the previous definition
   15 | #    define likely(x) __builtin_expect((x), 1)
      | 
pdfparser/poppler.cpp:962: warning: "unlikely" redefined
  962 |   #define unlikely(x) __builtin_expect(!!(x), 0)
      | 
In file included from pdfparser/../poppler_src/poppler/Object.h:44,
                 from pdfparser/../poppler_src/poppler/OutputDev.h:42,
                 from pdfparser/poppler.cpp:769:
pdfparser/../poppler_src/poppler/../goo/GooLikely.h:16: note: this is the location of the previous definition
   16 | #    define unlikely(x) __builtin_expect((x), 0)
      | 
In file included from /usr/include/dirent.h:245,
                 from pdfparser/../poppler_src/poppler/../goo/gfile.h:60,
                 from pdfparser/../poppler_src/poppler/Error.h:34,
                 from pdfparser/../poppler_src/poppler/GlobalParams.h:46,
                 from pdfparser/poppler.cpp:767:
pdfparser/poppler.cpp: In function ‘int __pyx_pf_9pdfparser_7poppler_8Document___cinit__(__pyx_obj_9pdfparser_7poppler_Document*, char*, PyLongObject*, double, PyLongObject*)’:
pdfparser/poppler.cpp:2293:79: error: cannot convert ‘long int’ to ‘const std::optional<GooString>&’
 2293 |   __pyx_v_self->_doc = PDFDocFactory().createPDFDoc(GooString(__pyx_v_fname), NULL);
      |                                                                               ^~~~
      |                                                                               |
      |                                                                               long int
In file included from pdfparser/poppler.cpp:772:
pdfparser/../poppler_src/poppler/PDFDocFactory.h:49:96: note:   initializing argument 2 of ‘std::unique_ptr<PDFDoc> PDFDocFactory::createPDFDoc(const GooString&, const std::optional<GooString>&, const std::optional<GooString>&, void*)’
   49 |     std::unique_ptr<PDFDoc> createPDFDoc(const GooString &uri, const std::optional<GooString> &ownerPassword = {}, const std::optional<GooString> &userPassword = {}, void *guiDataA = nullptr);
      |                                                                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~
pdfparser/poppler.cpp: In function ‘int __pyx_pf_9pdfparser_7poppler_4Page___cinit__(__pyx_obj_9pdfparser_7poppler_Page*, int, __pyx_obj_9pdfparser_7poppler_Document*)’:
pdfparser/poppler.cpp:3152:57: error: invalid conversion from ‘const TextFlow*’ to ‘TextFlow*’ [-fpermissive]
 3152 |   __pyx_v_self->curr_flow = __pyx_v_self->page->getFlows();
      |                             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
      |                                                         |
      |                                                         const TextFlow*
pdfparser/poppler.cpp: In function ‘PyObject* __pyx_pf_9pdfparser_7poppler_4Page_6__next__(__pyx_obj_9pdfparser_7poppler_Page*)’:
pdfparser/poppler.cpp:3386:61: error: invalid conversion from ‘const TextFlow*’ to ‘TextFlow*’ [-fpermissive]
 3386 |   __pyx_v_self->curr_flow = __pyx_v_self->curr_flow->getNext();
      |                             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
      |                                                             |
      |                                                             const TextFlow*
pdfparser/poppler.cpp: In function ‘int __pyx_pf_9pdfparser_7poppler_4Flow___cinit__(__pyx_obj_9pdfparser_7poppler_Flow*, __pyx_obj_9pdfparser_7poppler_Page*)’:
pdfparser/poppler.cpp:3753:59: error: invalid conversion from ‘const TextBlock*’ to ‘TextBlock*’ [-fpermissive]
 3753 |   __pyx_v_self->curr_block = __pyx_v_self->flow->getBlocks();
      |                              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
      |                                                           |
      |                                                           const TextBlock*
pdfparser/poppler.cpp: In function ‘PyObject* __pyx_pf_9pdfparser_7poppler_4Flow_4__next__(__pyx_obj_9pdfparser_7poppler_Flow*)’:
pdfparser/poppler.cpp:3905:63: error: invalid conversion from ‘const TextBlock*’ to ‘TextBlock*’ [-fpermissive]
 3905 |   __pyx_v_self->curr_block = __pyx_v_self->curr_block->getNext();
      |                              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
      |                                                               |
      |                                                               const TextBlock*
pdfparser/poppler.cpp: In function ‘int __pyx_pf_9pdfparser_7poppler_5Block___cinit__(__pyx_obj_9pdfparser_7poppler_Block*, __pyx_obj_9pdfparser_7poppler_Flow*)’:
pdfparser/poppler.cpp:4141:58: error: invalid conversion from ‘const TextLine*’ to ‘TextLine*’ [-fpermissive]
 4141 |   __pyx_v_self->curr_line = __pyx_v_self->block->getLines();
      |                             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
      |                                                          |
      |                                                          const TextLine*
pdfparser/poppler.cpp: In function ‘PyObject* __pyx_pf_9pdfparser_7poppler_5Block_4__next__(__pyx_obj_9pdfparser_7poppler_Block*)’:
pdfparser/poppler.cpp:4293:61: error: invalid conversion from ‘const TextLine*’ to ‘TextLine*’ [-fpermissive]
 4293 |   __pyx_v_self->curr_line = __pyx_v_self->curr_line->getNext();
      |                             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
      |                                                             |
      |                                                             const TextLine*
pdfparser/poppler.cpp: In function ‘PyObject* __pyx_pf_9pdfparser_7poppler_4Line_4_get_text(__pyx_obj_9pdfparser_7poppler_Line*)’:
pdfparser/poppler.cpp:8989:43: error: invalid conversion from ‘const TextWord*’ to ‘TextWord*’ [-fpermissive]
 8989 |   __pyx_v_w = __pyx_v_self->line->getWords();
      |               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
      |                                           |
      |                                           const TextWord*
pdfparser/poppler.cpp:9167:49: error: invalid conversion from ‘const GooString*’ to ‘GooString*’ [-fpermissive]
 9167 |       __pyx_v_font_name = __pyx_v_w->getFontName(__pyx_v_i);
      |                           ~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~
      |                                                 |
      |                                                 const GooString*
pdfparser/poppler.cpp:9186:65: error: ‘class GooString’ has no member named ‘getCString’
 9186 |         __pyx_t_9 = __Pyx_PyBytes_FromString(__pyx_v_font_name->getCString()); if (unlikely(!__pyx_t_9)) __PYX_ERR(0, 489, __pyx_L1_error)
      |                                                                 ^~~~~~~~~~
pdfparser/poppler.cpp:9315:33: error: ‘class GooString’ has no member named ‘getCString’
 9315 |     __pyx_v_s_cstr = __pyx_v_s->getCString();
      |                                 ^~~~~~~~~~
pdfparser/poppler.cpp:9583:35: error: invalid conversion from ‘const TextWord*’ to ‘TextWord*’ [-fpermissive]
 9583 |     __pyx_v_w = __pyx_v_w->getNext();
      |                 ~~~~~~~~~~~~~~~~~~^~
      |                                   |
      |                                   const TextWord*
pdfparser/poppler.cpp: In function ‘int __pyx_pymod_exec_poppler(PyObject*)’:
pdfparser/poppler.cpp:12930:18: error: no match for ‘operator=’ (operand types are ‘std::unique_ptr<GlobalParams>’ and ‘GlobalParams*’)
12930 |   globalParams = __pyx_t_2;
      |                  ^~~~~~~~~
In file included from /usr/include/c++/11/memory:76,
                 from pdfparser/../poppler_src/poppler/UnicodeMap.h:36,
                 from pdfparser/../poppler_src/poppler/GlobalParams.h:45,
                 from pdfparser/poppler.cpp:767:
/usr/include/c++/11/bits/unique_ptr.h:386:9: note: candidate: ‘template<class _Up, class _Ep> typename std::enable_if<std::__and_<std::__and_<std::is_convertible<typename std::unique_ptr<_Up, _Ep>::pointer, typename std::__uniq_ptr_impl<_Tp, _Dp>::pointer>, std::__not_<std::is_array<_Up> > >, std::is_assignable<_T2&, _U2&&> >::value, std::unique_ptr<_Tp, _Dp>&>::type std::unique_ptr<_Tp, _Dp>::operator=(std::unique_ptr<_Up, _Ep>&&) [with _Up = _Up; _Ep = _Ep; _Tp = GlobalParams; _Dp = std::default_delete<GlobalParams>]’
  386 |         operator=(unique_ptr<_Up, _Ep>&& __u) noexcept
      |         ^~~~~~~~
/usr/include/c++/11/bits/unique_ptr.h:386:9: note:   template argument deduction/substitution failed:
pdfparser/poppler.cpp:12930:18: note:   mismatched types ‘std::unique_ptr<_Tp, _Dp>’ and ‘GlobalParams*’
12930 |   globalParams = __pyx_t_2;
      |                  ^~~~~~~~~
In file included from /usr/include/c++/11/memory:76,
                 from pdfparser/../poppler_src/poppler/UnicodeMap.h:36,
                 from pdfparser/../poppler_src/poppler/GlobalParams.h:45,
                 from pdfparser/poppler.cpp:767:
/usr/include/c++/11/bits/unique_ptr.h:371:19: note: candidate: ‘std::unique_ptr<_Tp, _Dp>& std::unique_ptr<_Tp, _Dp>::operator=(std::unique_ptr<_Tp, _Dp>&&) [with _Tp = GlobalParams; _Dp = std::default_delete<GlobalParams>]’
  371 |       unique_ptr& operator=(unique_ptr&&) = default;
      |                   ^~~~~~~~
/usr/include/c++/11/bits/unique_ptr.h:371:29: note:   no known conversion for argument 1 from ‘GlobalParams*’ to ‘std::unique_ptr<GlobalParams>&&’
  371 |       unique_ptr& operator=(unique_ptr&&) = default;
      |                             ^~~~~~~~~~~~
/usr/include/c++/11/bits/unique_ptr.h:395:7: note: candidate: ‘std::unique_ptr<_Tp, _Dp>& std::unique_ptr<_Tp, _Dp>::operator=(std::nullptr_t) [with _Tp = GlobalParams; _Dp = std::default_delete<GlobalParams>; std::nullptr_t = std::nullptr_t]’
  395 |       operator=(nullptr_t) noexcept
      |       ^~~~~~~~
/usr/include/c++/11/bits/unique_ptr.h:395:17: note:   no known conversion for argument 1 from ‘GlobalParams*’ to ‘std::nullptr_t’
  395 |       operator=(nullptr_t) noexcept
      |                 ^~~~~~~~~
error: command '/usr/bin/x86_64-linux-gnu-gcc' failed with exit code 1