ropensci / pdftools

Text Extraction, Rendering and Converting of PDF Documents
https://docs.ropensci.org/pdftools
Other
513 stars 69 forks source link

Suggestion to note changes to pdf_text() processing in poppler version 20.12.1 #92

Open hrbrmstr opened 3 years ago

hrbrmstr commented 3 years ago

I've been doing some back-and-forth testing between R 4.0.x and R 4.1.0 on macOS (both chipsets) of pretty much every pacakge I use and so far most things work perfectly.

The 4.1.0 CRAN macOS binary for {pdftools} is Using poppler version 20.12.1 whereas the 4.0.x CRAN macOS binary for {pdftools} is Using poppler version 0.73.0. Both are versioned pdftools_2.3.1.

R 4.0.4 `sessionInfo()`
R version 4.0.4 (2021-02-15)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] stringi_1.4.6  pdftools_2.3.1

loaded via a namespace (and not attached):
 [1] compiler_4.0.4  magrittr_1.5    ellipsis_0.3.1  tools_4.0.4
 [5] pillar_1.4.6    tibble_3.0.3    crayon_1.3.4    Rcpp_1.0.5
 [9] vctrs_0.3.4     qpdf_1.1        lifecycle_0.2.0 pkgconfig_2.0.3
[13] rlang_0.4.7     askpass_1.1
R 4.1.0 `sessionInfo()`
R Under development (unstable) (2021-03-29 r80130)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] stringi_1.5.3  pdftools_2.3.1

loaded via a namespace (and not attached):
[1] compiler_4.1.0 tools_4.1.0    Rcpp_1.0.6     qpdf_1.1       askpass_1.1

Example code run in both sessions:

tf <- tempfile(fileext = ".pdf")
download.file("https://rud.is/dl/unit42-ransomware-threat-report-2021.pdf", tf)

library(stringi)
library(pdftools)

l <- pdf_text(tf)

stri_split_lines(l[[7]])[[1]]

# see output in the two details blocks below

unlnk(tf)
R 4.0.4 example output
stri_split_lines(l[[7]])[[1]]
 [1] "       100+   20–40     10–20     1–10"
 [2] "                 Number of victim organizations with data published on leak sites by country"
 [3] "United States            151    Belgium             4    Chile               1     Pakistan               1"
 [4] "Canada                   39     Sweden              4    Colombia            1     Peru                   1"
 [5] "Germany                  26     South Africa        3    Croatia             1     Poland                 1"
 [6] "United Kingdom           17     Spain               3    Greece              1     Portugal               1"
 [7] "France                   16     Japan               2    Hong Kong           1     Saudi Arabia           1"
 [8] "India                    11     Mexico              2    Jamaica             1     Singapore              1"
 [9] "Australia                7      New Zealand         2    Kenya               1     Sri Lanka              1"
[10] "Brazil                   5      South Korea         2    Luxembourg          1     Taiwan                 1"
[11] "Israel                   5      Switzerland         2    Malaysia            1     Thailand               1"
[12] "Italy                    5      Austria             1    Norway              1     United Arab Emirates   1"
[13] "                      Figure 3: Numbers of victim organizations with data"
[14] "                    published on leak sites by country, Jan. 2020 – Jan. 2021"
[15] "                    Pa l o A l to N et wo r ks | U n i t 4 2 | R a n s o mwa re T h re at R e p o r t, 2 02 1 7"
[16] ""
R 4.1.0 example output
stri_split_lines(l[[7]])[[1]]
 [1] "         100+   20–40       10–20     1–10"
 [2] ""
 [3] ""
 [4] ""
 [5] "                  Number of victim organizations with data published on leak sites by country"
 [6] ""
 [7] "United States               151     Belgium            4    Chile              1    Pakistan                 1"
 [8] ""
 [9] "Canada                      39      Sweden             4    Colombia           1    Peru                     1"
[10] ""
[11] "Germany                     26      South Africa       3    Croatia            1    Poland                   1"
[12] ""
[13] "United Kingdom              17      Spain              3    Greece             1    Portugal                 1"
[14] ""
[15] "France                      16      Japan              2    Hong Kong          1    Saudi Arabia             1"
[16] ""
[17] "India                       11      Mexico             2    Jamaica            1    Singapore                1"
[18] ""
[19] "Australia                   7       New Zealand        2    Kenya              1    Sri Lanka                1"
[20] ""
[21] "Brazil                      5       South Korea        2    Luxembourg         1    Taiwan                   1"
[22] ""
[23] "Israel                      5       Switzerland        2    Malaysia           1    Thailand                 1"
[24] ""
[25] "Italy                       5       Austria            1    Norway             1    United Arab Emirates     1"
[26] ""
[27] ""
[28] ""
[29] "                          Figure 3: Numbers of victim organizations with data"
[30] "                        published on leak sites by country, Jan. 2020 – Jan. 2021"
[31] ""
[32] ""
[33] ""
[34] ""
[35] "                        Pa l o A l to N et wo r ks | U n i t 4 2 | R a n s o mwa re T h re at R e p o r t, 2 02 1   7"
[36] ""
[37] ""

This is very likely a behavior change in the underlying poppler library but is definitely going to break at least some automation folks might have setup, so I'm posting the issue as more of a "heads up" and "may want to note this when 4.1.0 is live". I didn't see anything specific to this "additional newlines" directly in any of the popper changelog.

One thing you'll note if you run the example code is the generation of (IIRC) 19 PDF error: Invalid Font Weight messages, but I don't think that's causing this issue.

jeroen commented 3 years ago

Hmm I am also seeing output changes on windows with recent versions of poppler. This is very annoying :/

jeroen commented 3 years ago

I have bisected the issue and reported upstream: https://gitlab.freedesktop.org/poppler/poppler/-/issues/1076