ropensci / pdftools

Text Extraction, Rendering and Converting of PDF Documents
https://docs.ropensci.org/pdftools
Other
518 stars 69 forks source link

v3 cannot read pdf files #105

Closed Napping-Lunar closed 2 years ago

Napping-Lunar commented 2 years ago

I use Fedora Linux so I'll provide the steps to reproduce this issue.

  1. Take a fresh install of Fedora Linux 35 (Workstation Edition) or use a live USB with enough space.
  2. Install R (v4.1.2), R-magick (v2.7.3), R-pdftools (v3.0.1), through dnf: sudo dnf install R R-magick R-pdftools
  3. Run R and try to import a pdf, as demonstrated in the rOpenSci magick intro page:
    library(magick)
    manual <- image_read_pdf('https://cloud.r-project.org/web/packages/magick/magick.pdf', density = 72)
  4. Receive the following error, aborting R:
    /usr/include/c++/11/bits/basic_string.h:1136: std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::reference std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::back() [with _CharT = char; _Traits = std::char_traits<char>; _Alloc = std::allocator<char>; std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::reference = char&]: Assertion '!empty()' failed.

    This also occurs when installing the magick (v2.7.2) and pdftools (v3.0.0) R packages instead from the RStudio repository (requires installing ImageMagick-c++-devel libcurl-devel libjpeg-turbo-devel poppler-cpp-devel via dnf first):

    options(repos = c(REPO_NAME = "https://packagemanager.rstudio.com/all/2021-05-07+Y3JhbiwyOjE3NTA3NTI7OUM2OTYyMkM"))
    install.packages(c("magick", "pdftools"))
    # Select "yes" twice when prompted to create personal library
    manual <- image_read_pdf('https://cloud.r-project.org/web/packages/magick/magick.pdf', density = 72)

    This does not happen with magick v2.7.2 and pdftools v2.3.1 (replace the REPO_NAME with options(repos = c(REPO_NAME = "https://packagemanager.rstudio.com/all/2021-05-03+Y3JhbiwyOjE3NTA3NTI7NkMwNDVBMjE"))).

Unfortunately the error message is too opaque for me to identify the issue further, so any help would be much appreciated!

jeroen commented 2 years ago

It looks like the build of R-pdftools on Fedora 35 is broken. They probably compiled R-pdftools against another version of libpoppler than the one that is currently in F-35.

@Enchufa2 can you have a look? Here is a minimal example:

pdftools::pdf_info('https://cloud.r-project.org/web/packages/magick/magick.pdf')
Enchufa2 commented 2 years ago

Unless I missed something, builds are fine (pinging @QuLogic, the maintainer of this package). Line 1136 of basic_string.h is this:

__glibcxx_assert(!empty());

Fedora enables -D_GLIBCXX_ASSERTIONS to catch this kind of thing, which denotes a bug in the package.

Enchufa2 commented 2 years ago

Here's the bug:

https://github.com/ropensci/pdftools/blob/a3155fb8e2ca5e0ac9b06a83678a09b91606cbb6/src/bindings.cpp#L49

This function shall not be called on empty strings. [ref]

jeroen commented 2 years ago

Here's the bug:

Thanks! So this issue is fixed if we replace that line with if(str.length() && str.back()) ?

Enchufa2 commented 2 years ago

Correct.

jeroen commented 2 years ago

OK I'll add that. Doesn't Fedora run CMD check when building R-pdftools ? This should have triggered in a unit test?

QuLogic commented 2 years ago

It does and it should have, but whoever bumped to 3.0.1 turned them off to bootstrap and never remembered to enable them, or didn't report a bug to either of us.

Enchufa2 commented 2 years ago

According to the build history, spot updated the package, triggered a bootstrap rebuild and a normal one, which failed. He was probably too busy rebuilding R and the whole stack of packages and forgot to look into it.

jeroen commented 2 years ago

OK well thanks for helping me debugging. If you run into any other crashes on Fedora with one of my packages I would love to hear about it of course.

QuLogic commented 2 years ago

Well yes, I was trying to avoid naming names...

In any case, I've submitted an update with the fix backported.

Enchufa2 commented 2 years ago

Well yes, I was trying to avoid naming names...

Not pointing fingers here. Quite the opposite: rebuilding R and the entire stack is a pretty demanding task, as you know, so it's perfectly understandable that something like this slipped through.