ropensci / pdftools

Text Extraction, Rendering and Converting of PDF Documents
https://docs.ropensci.org/pdftools
Other
518 stars 69 forks source link

Compatibility pdftools + R 4.2.1/4.2.2 #120

Open susannabolz opened 1 year ago

susannabolz commented 1 year ago

I have a problem when using pdftools::pdf_text() with some PDFs when using R 4.x. For most PDFs, everything works, but there are some PDFs where a fatal error occurs. I'm using Rstudio 2022.12.0+353 "Elsbeth Geranium". I tried to figure out whether there is a particular characteristic of the PDFs where the fatal error occurs, but did not find any. In case it is related to the PDF characteristics, I attached the respective files. The same problem occurs when using R.4.1. When using an older version of R (I tried 4.1.3.), everything works as expected.

sessionInfo() R version 4.2.2 (2022-10-31 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 22621) Matrix products: default
locale: [1] LC_COLLATE=German_Germany.utf8 LC_CTYPE=German_Germany.utf8 [3] LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C [5] LC_TIME=German_Germany.utf8
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] dplyr_1.0.10 data.table_1.14.6 stringr_1.5.0 pdftools_3.3.2
loaded via a namespace (and not attached): [1] Rcpp_1.0.9 rstudioapi_0.14 magrittr_2.0.3 tidyselect_1.2.0 timechange_0.2.0 [6] R6_2.5.1 rlang_1.0.6 fansi_1.0.3 tools_4.2.2 utf8_1.2.2 [11] cli_3.6.0 DBI_1.1.3 askpass_1.1 assertthat_0.2.1 tibble_3.1.8 [16] lifecycle_1.0.3 qpdf_1.3.0 vctrs_0.5.1 glue_1.6.2 stringi_1.7.12 [21] compiler_4.2.2 pillar_1.8.1 generics_0.1.3 lubridate_1.9.0 pkgconfig_2.0.3

PDF_no error.pdf PDF_fatal error.pdf

ugfmoritz commented 1 year ago

I have the same problem!

It must be some kind of protection that pdftools cannot work with. I identified one PDF where this problem exists and tried to solve it with pikepdf (Python) programatically which shows some effect in that it is not protected anymore, but there is still a fatal error when I open it with pdftools. I also unprotected it with the online tool of ilovepdf and that is where it works. So I guess it is some protection mechanism.

R version 4.2.2 (2022-10-31 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19044)

Matrix products: default
locale: [1] LC_COLLATE=German_Germany.utf8 LC_CTYPE=German_Germany.utf8 LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C LC_TIME=German_Germany.utf8

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] janitor_2.1.0 installr_0.23.4 forcats_0.5.2 stringr_1.5.0 dplyr_1.0.10 purrr_1.0.0 readr_2.1.3 tidyr_1.2.1 tibble_3.1.8 ggplot2_3.4.0 tidyverse_1.3.2 [12] parsedate_1.3.1 data.table_1.14.6 lubridate_1.9.0 timechange_0.1.1 pdftools_3.3.2

loaded via a namespace (and not attached): [1] qpdf_1.3.0 tidyselect_1.2.0 haven_2.5.1 gargle_1.2.1 snakecase_0.11.0 colorspace_2.0-3 vctrs_0.5.1 generics_0.1.3 utf8_1.2.2 rlang_1.0.6 [11] pillar_1.8.1 glue_1.6.2 withr_2.5.0 DBI_1.1.3 dbplyr_2.2.1 modelr_0.1.10 readxl_1.4.1 lifecycle_1.0.3 munsell_0.5.0 gtable_0.3.1 [21] cellranger_1.1.0 rvest_1.0.3 tzdb_0.3.0 fansi_1.0.3 broom_1.0.2 Rcpp_1.0.9 scales_1.2.1 backports_1.4.1 googlesheets4_1.0.1 jsonlite_1.8.4 [31] fs_1.5.2 askpass_1.1 hms_1.1.2 stringi_1.7.8 grid_4.2.2 cli_3.5.0 tools_4.2.2 magrittr_2.0.3 crayon_1.5.2 pkgconfig_2.0.3 [41] ellipsis_0.3.2 xml2_1.3.3 reprex_2.0.2 googledrive_2.0.0 assertthat_0.2.1 httr_1.4.4 rstudioapi_0.14 R6_2.5.1 compiler_4.2.2

--

Attached you find all three documents. DE000A0WMPJ6_Q1_2015_unlocked_works.pdf DE000A0WMPJ6_Q1_2015_original.pdf DE000A0WMPJ6_Q1_2015_unlocked_doesnt_work.pdf