I've been doing some back-and-forth testing between R 4.0.x and R 4.1.0 on macOS (both chipsets) of pretty much every pacakge I use and so far most things work perfectly.
The 4.1.0 CRAN macOS binary for {pdftools} is Using poppler version 20.12.1 whereas the 4.0.x CRAN macOS binary for {pdftools} is Using poppler version 0.73.0. Both are versioned pdftools_2.3.1.
R 4.0.4 `sessionInfo()`
R version 4.0.4 (2021-02-15)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] stringi_1.4.6 pdftools_2.3.1
loaded via a namespace (and not attached):
[1] compiler_4.0.4 magrittr_1.5 ellipsis_0.3.1 tools_4.0.4
[5] pillar_1.4.6 tibble_3.0.3 crayon_1.3.4 Rcpp_1.0.5
[9] vctrs_0.3.4 qpdf_1.1 lifecycle_0.2.0 pkgconfig_2.0.3
[13] rlang_0.4.7 askpass_1.1
R 4.1.0 `sessionInfo()`
R Under development (unstable) (2021-03-29 r80130)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] stringi_1.5.3 pdftools_2.3.1
loaded via a namespace (and not attached):
[1] compiler_4.1.0 tools_4.1.0 Rcpp_1.0.6 qpdf_1.1 askpass_1.1
Example code run in both sessions:
tf <- tempfile(fileext = ".pdf")
download.file("https://rud.is/dl/unit42-ransomware-threat-report-2021.pdf", tf)
library(stringi)
library(pdftools)
l <- pdf_text(tf)
stri_split_lines(l[[7]])[[1]]
# see output in the two details blocks below
unlnk(tf)
R 4.0.4 example output
stri_split_lines(l[[7]])[[1]]
[1] " 100+ 20–40 10–20 1–10"
[2] " Number of victim organizations with data published on leak sites by country"
[3] "United States 151 Belgium 4 Chile 1 Pakistan 1"
[4] "Canada 39 Sweden 4 Colombia 1 Peru 1"
[5] "Germany 26 South Africa 3 Croatia 1 Poland 1"
[6] "United Kingdom 17 Spain 3 Greece 1 Portugal 1"
[7] "France 16 Japan 2 Hong Kong 1 Saudi Arabia 1"
[8] "India 11 Mexico 2 Jamaica 1 Singapore 1"
[9] "Australia 7 New Zealand 2 Kenya 1 Sri Lanka 1"
[10] "Brazil 5 South Korea 2 Luxembourg 1 Taiwan 1"
[11] "Israel 5 Switzerland 2 Malaysia 1 Thailand 1"
[12] "Italy 5 Austria 1 Norway 1 United Arab Emirates 1"
[13] " Figure 3: Numbers of victim organizations with data"
[14] " published on leak sites by country, Jan. 2020 – Jan. 2021"
[15] " Pa l o A l to N et wo r ks | U n i t 4 2 | R a n s o mwa re T h re at R e p o r t, 2 02 1 7"
[16] ""
R 4.1.0 example output
stri_split_lines(l[[7]])[[1]]
[1] " 100+ 20–40 10–20 1–10"
[2] ""
[3] ""
[4] ""
[5] " Number of victim organizations with data published on leak sites by country"
[6] ""
[7] "United States 151 Belgium 4 Chile 1 Pakistan 1"
[8] ""
[9] "Canada 39 Sweden 4 Colombia 1 Peru 1"
[10] ""
[11] "Germany 26 South Africa 3 Croatia 1 Poland 1"
[12] ""
[13] "United Kingdom 17 Spain 3 Greece 1 Portugal 1"
[14] ""
[15] "France 16 Japan 2 Hong Kong 1 Saudi Arabia 1"
[16] ""
[17] "India 11 Mexico 2 Jamaica 1 Singapore 1"
[18] ""
[19] "Australia 7 New Zealand 2 Kenya 1 Sri Lanka 1"
[20] ""
[21] "Brazil 5 South Korea 2 Luxembourg 1 Taiwan 1"
[22] ""
[23] "Israel 5 Switzerland 2 Malaysia 1 Thailand 1"
[24] ""
[25] "Italy 5 Austria 1 Norway 1 United Arab Emirates 1"
[26] ""
[27] ""
[28] ""
[29] " Figure 3: Numbers of victim organizations with data"
[30] " published on leak sites by country, Jan. 2020 – Jan. 2021"
[31] ""
[32] ""
[33] ""
[34] ""
[35] " Pa l o A l to N et wo r ks | U n i t 4 2 | R a n s o mwa re T h re at R e p o r t, 2 02 1 7"
[36] ""
[37] ""
This is very likely a behavior change in the underlying poppler library but is definitely going to break at least some automation folks might have setup, so I'm posting the issue as more of a "heads up" and "may want to note this when 4.1.0 is live". I didn't see anything specific to this "additional newlines" directly in any of the popper changelog.
One thing you'll note if you run the example code is the generation of (IIRC) 19 PDF error: Invalid Font Weight messages, but I don't think that's causing this issue.
I've been doing some back-and-forth testing between R 4.0.x and R 4.1.0 on macOS (both chipsets) of pretty much every pacakge I use and so far most things work perfectly.
The 4.1.0 CRAN macOS binary for {pdftools} is
Using poppler version 20.12.1
whereas the 4.0.x CRAN macOS binary for {pdftools} isUsing poppler version 0.73.0
. Both are versionedpdftools_2.3.1
.R 4.0.4 `sessionInfo()`
R 4.1.0 `sessionInfo()`
Example code run in both sessions:
R 4.0.4 example output
R 4.1.0 example output
This is very likely a behavior change in the underlying
poppler
library but is definitely going to break at least some automation folks might have setup, so I'm posting the issue as more of a "heads up" and "may want to note this when 4.1.0 is live". I didn't see anything specific to this "additional newlines" directly in any of thepopper
changelog.One thing you'll note if you run the example code is the generation of (IIRC) 19
PDF error: Invalid Font Weight
messages, but I don't think that's causing this issue.