ropensci / pdftools

Text Extraction, Rendering and Converting of PDF Documents
https://docs.ropensci.org/pdftools
Other
513 stars 69 forks source link

Installation of ''pdftools" with poppler 22.06 works in terminal but not in RStudio #115

Closed mbojan closed 2 years ago

mbojan commented 2 years ago

I'm trying to understand where the problem is in what's below. It may be unrelated to pdftools and poppler but I'd appreciate the expertise.

I have some R scripts that use pdftools and they need to work on Windows and on Ubuntu 20.04. While on Windows pdftools installs with the most recent version of poppler on Ubuntu I'm limited to 0.86.1 available in the repositories. I try installing the latest release from the sources following the instructions at https://askubuntu.com/a/1112947/58469 . Everything works in the terminal: pdftools installs correctly and I can load the package and use it. However, it does not work in RStudio with the error:

> library(pdftools)
Error: package or namespace load failed for ‘pdftools’ in dyn.load(file, DLLpath = DLLpath, ...):
 unable to load shared object '/home/mbojan/R/library/4.2/pdftools/libs/pdftools.so':
  /home/mbojan/R/library/4.2/pdftools/libs/pdftools.so: undefined symbol: _ZNK7poppler8text_box13has_font_infoEv

I have no clue why. Any ideas? Is it RStudio? Is it the environment?

Thanks!

─ Session info ───────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.2.0 (2022-04-22)
 os       Ubuntu 20.04.4 LTS
 system   x86_64, linux-gnu
 ui       RStudio
 language en_US
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Europe/Warsaw
 date     2022-06-14
 rstudio  2022.02.3+492 Prairie Trillium (desktop)
 pandoc   2.18 @ /usr/bin/pandoc

─ Packages ───────────────────────────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 askpass       1.1     2019-01-13 [1] CRAN (R 4.2.0)
 cli           3.3.0   2022-04-25 [1] CRAN (R 4.2.0)
 crayon        1.5.1   2022-03-26 [1] CRAN (R 4.2.0)
 ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.2.0)
 fansi         1.0.3   2022-03-24 [1] CRAN (R 4.2.0)
 fs            1.5.2   2021-12-08 [1] CRAN (R 4.2.0)
 glue          1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
 here          1.0.1   2020-12-13 [1] CRAN (R 4.2.0)
 lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.2.0)
 magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
 pillar        1.7.0   2022-02-01 [1] CRAN (R 4.2.0)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.2.0)
 purrr         0.3.4   2020-04-17 [1] CRAN (R 4.2.0)
 qpdf          1.2.0   2022-05-29 [1] CRAN (R 4.2.0)
 Rcpp          1.0.8.3 2022-03-17 [1] CRAN (R 4.2.0)
 rlang         1.0.2   2022-03-04 [1] CRAN (R 4.2.0)
 rprojroot     2.0.3   2022-04-02 [1] CRAN (R 4.2.0)
 sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
 tibble        3.1.7   2022-05-03 [1] CRAN (R 4.2.0)
 usethis       2.1.6   2022-05-25 [1] CRAN (R 4.2.0)
 utf8          1.2.2   2021-07-24 [1] CRAN (R 4.2.0)
 vctrs         0.4.1   2022-04-13 [1] CRAN (R 4.2.0)
 xml2          1.3.3   2021-11-30 [1] CRAN (R 4.2.0)

 [1] /home/mbojan/R/library/4.2
 [2] /usr/local/lib/R/site-library
 [3] /usr/lib/R/site-library
 [4] /usr/lib/R/library

──────────────────────────────────────────────────────────────────────────────────────────
jeroen commented 2 years ago

Probably some issue with your R_LD_LIBRARY_PATH or LD_LIBRARY_PATH. Why do you want to build your own poppler and not use the stock version from ubuntu?

mbojan commented 2 years ago

The reason is that I believe pdf_text() given the same PDF was giving me different results on Windows and on Ubuntu. And so I blamed different versions of poppler.

mbojan commented 2 years ago

So this turns out a serious problem for me. I found rstudio/rstudio#9003 which was frivolously closed "by design". I guess for my PDF-disassembling R code to work portably on Linux and Windows I need to either install a recent poppler on Linux or downgrade/freeze the version on Windows. Do you have any suggestions, @jeroen ?

jeroen commented 2 years ago

I have added new backports for you in the cran:poppler PPA. See the instructions here: https://github.com/ropensci/pdftools#installation

This way you should be able to use the same version of poppler as you get in mac/win, without having to build poppler from source.

MohammadAliAmir commented 9 months ago

Hi @jeroen, I have somewhat a similar error but I receive this whenever I try to install pdftools in RStudio. install.packages("pdftools") Error message in testing step of installation: ** testing if installed package can be loaded from temporary location Error: package or namespace load failed for ‘pdftools’ in dyn.load(file, DLLpath = DLLpath, ...): unable to load shared object '/home/amirmo/R/x86_64-pc-linux-gnu-library/4.2/00LOCK-pdftools/00new/pdftools/libs/pdftools.so': /home/amirmo/R/x86_64-pc-linux-gnu-library/4.2/00LOCK-pdftools/00new/pdftools/libs/pdftools.so: undefined symbol: _ZNK7poppler7ustring9to_latin1B5cxx11Ev Error: loading failed Execution halted

I'm running this in a container with a RHEL 7 base and I've installed poppler-cpp-devel as you mentioned in the install procedure. Also tried to adjust the LD_LIBRARY_PATH to include the supposed directory where pdftools would be installed but that had no effect.

R-version: R-4.2.2 RStudio Version Info: RStudio 2022.02.3+492 "Prairie Trillium" Release (1db809b8323ba0a87c148d16eb84efe39a8e7785, 2022-05-20) for CentOS 7 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36

jeroen commented 9 months ago

It may be easier for you to install the precompiled binary for centos-7. Try this:

install.packages("https://p3m.dev/cran/__linux__/centos7/latest/src/contrib/pdftools_3.4.0.tar.gz?r_version=4.2&arch=x86_64", repos = NULL)
MohammadAliAmir commented 9 months ago

I've first tried directly from RStudio, I got some SSL errors > install.packages("https://p3m.dev/cran/__linux__/centos7/latest/src/contrib/pdftools_3.4.0.tar.gz?r_version=4.2&arch=x86_64", repos = NULL) Installing package into ‘/home/amirmo/R/x86_64-pc-linux-gnu-library/4.2’ (as ‘lib’ is unspecified) trying URL 'https://p3m.dev/cran/__linux__/centos7/latest/src/contrib/pdftools_3.4.0.tar.gz?r_version=4.2&arch=x86_64' Warning in install.packages : URL 'https://p3m.dev/cran/__linux__/centos7/latest/src/contrib/pdftools_3.4.0.tar.gz?r_version=4.2&arch=x86_64': status was 'SSL connect error' Error in download.file(p, destfile, method, mode = "wb", ...) : cannot open URL 'https://p3m.dev/cran/__linux__/centos7/latest/src/contrib/pdftools_3.4.0.tar.gz?r_version=4.2&arch=x86_64' so I tried to manually download the tarball and install it from the tarball. That also does not seem to work `> install.packages("pdftools_3.4.0.tar.gz") Installing package into ‘/home/amirmo/R/x86_64-pc-linux-gnu-library/4.2’ (as ‘lib’ is unspecified) Warning in install.packages : package ‘pdftools_3.4.0.tar.gz’ is not available for this version of R

A version of this package for your version of R might be available elsewhere, see the ideas at https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages` Not sure if fixing those SSL errors will do it if installing from tarball didn't work either.

jeroen commented 9 months ago

You have to add repos=NULL to install.packages if you are installing from a file.

MohammadAliAmir commented 9 months ago

Oh yes, you are correct, my mistake! Installation went fine btw.