ropensci / pdftools

Text Extraction, Rendering and Converting of PDF Documents
https://docs.ropensci.org/pdftools
Other
513 stars 69 forks source link

Missing font #107

Open dcaud opened 2 years ago

dcaud commented 2 years ago

I have seen a lot of the following type of errors on various PDFs:

PDF error: Unknown font in field's DA string PDF error: Missing 'Tf' operator in field's DA string

For example, this Alberta-tf-operator-error-CAV-2-FORMB.pdf file has text on the buttons on the second page (as Viewed in Mac's Preview or Adobe's Acrobat Pro DC). However, converting it to png, it loses that text and displays the missing font message in the R console.

pdftools::pdf_convert("Alberta-tf-operator-error-CAV-2-FORMB.pdf",
                      page=2)

This may be a PDF file that doesn't adhere to the PDF spec, but because many PDFs do not, I'd like this to work in some fashion.

Is there any way to get pdftools to render the button text in this example file? Maybe that would point to how this can be generalized to other PDFs with similar issues.

jeroen commented 2 years ago

Hmm I'm not sure. I don't think the buttons contain any text, but actually a small image. If we extract the text it does not appear either:

cat(pdftools::pdf_text('Alberta-tf-operator-error-CAV-2-FORMB.pdf')[2])

But I am also not sure why the image does not appear in the output.

jeroen commented 2 years ago

Oh it actually seems to work with a later version of the poppler library. Maybe I should update it again.

jeroen commented 2 years ago

Which operating system do you use?

dcaud commented 2 years ago

I'm using both Mac and Linux. Here's a profile from the Mac. Thanks for looking into this!

sessionInfo() R version 4.1.2 (2021-11-01) Platform: aarch64-apple-darwin20 (64-bit) Running under: macOS Monterey 12.1

Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] tools stats graphics grDevices utils datasets methods base

other attached packages: [1] shinyBS_0.61 jsonlite_1.7.3 mongolite_2.4.1 ipc_0.1.3
[5] future_1.23.0 promises_1.2.0.1 googleAuthR_2.0.0 firebase_1.0.1
[9] RPostgres_1.4.3 pool_0.1.6 dplyr_1.0.8 shinyjs_2.1.0
[13] pdftools_3.0.1 shinybusy_0.2.2 shinyWidgets_0.6.4 magick_2.7.3
[17] colourpicker_1.1.1 shiny_1.7.1

loaded via a namespace (and not attached): [1] Rcpp_1.0.8 lubridate_1.8.0 txtq_0.2.4 listenv_0.8.0
[5] assertthat_0.2.1 digest_0.6.29 utf8_1.2.2 parallelly_1.30.0 [9] mime_0.12 R6_2.5.1 backports_1.4.1 httr_1.4.2
[13] pillar_1.7.0 rlang_1.0.1 curl_4.3.2 rstudioapi_0.13
[17] fontawesome_0.2.2 miniUI_0.1.1.1 jquerylib_0.1.4 blob_1.2.2
[21] qpdf_1.1 htmlwidgets_1.5.4 bit_4.0.4 jose_1.2.0
[25] compiler_4.1.2 httpuv_1.6.5 pkgconfig_2.0.3 askpass_1.1
[29] base64enc_0.1-3 globals_0.14.0 htmltools_0.5.2 openssl_1.4.6
[33] tidyselect_1.1.1 tibble_3.1.6 codetools_0.2-18 fansi_1.0.2
[37] crayon_1.4.2 withr_2.4.3 later_1.3.0 xtable_1.8-4
[41] lifecycle_1.0.1 DBI_1.1.2 magrittr_2.0.2 cli_3.1.1
[45] cachem_1.0.6 fs_1.5.2 bslib_0.3.1 filelock_1.0.2
[49] ellipsis_0.3.2 generics_0.1.2 vctrs_0.3.8 bit64_4.0.5
[53] glue_1.6.1 purrr_0.3.4 hms_1.1.1 parallel_4.1.2
[57] fastmap_1.1.0 gargle_1.2.0 base64url_1.4 memoise_2.0.1
[61] sass_0.4.0

jeroen commented 2 years ago

I have released a new version pdftools 3.1.0 that includes a more recent version of libpoppler for Windows and MacOS. You can test it from here:

install.packages("pdftools", repos = "https://ropensci.r-universe.dev")

For Linux it is a bit more tricky because we use the libpoppler that is included with your linux distribution. I think the problem should be fixed at least in ubuntu 22.04 that will be released in april, because it includes poppler 22.02: https://packages.ubuntu.com/jammy/libpoppler-dev

I'm not sure about the other distros, it really depends what OS you use.

krcabrer commented 2 years ago

I have the same issue, but in this case, I cannot update to pdftools version 3.1.0.

R byte-compile and prepare package for lazy loading ** help * installing help indices building package indices ** testing if installed package can be loaded from temporary location Error: package or namespace load failed for ‘pdftools’ in dyn.load(file, DLLpath = DLLpath, ...): unable to load shared object '/home/kenneth/R/x86_64-pc-linux-gnu-library/4.1/00LOCK-pdftools/00new/pdftools/libs/pdftools.so': /home/kenneth/R/x86_64-pc-linux-gnu-library/4.1/00LOCK-pdftools/00new/pdftools/libs/pdftools.so: undefined symbol: _ZNK7poppler8text_box13has_font_infoEv Error: loading failed Ejecución interrumpida ERROR: loading failed

  • removing ‘/home/kenneth/R/x86_64-pc-linux-gnu-library/4.1/pdftools’
  • restoring previous ‘/home/kenneth/R/x86_64-pc-linux-gnu-library/4.1/pdftools’

The downloaded source packages are in ‘/tmp/Rtmp8BrZD7/downloaded_packages’ Warning message: In install.packages(c("pdftools")) : installation of package ‘pdftools’ had non-zero exit status

Any workaround?

This is my platform:

R version 4.1.2 (2021-11-01) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.4 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale: [1] LC_CTYPE=es_CO.UTF-8 LC_NUMERIC=C
[3] LC_TIME=es_CO.UTF-8 LC_COLLATE=es_CO.UTF-8
[5] LC_MONETARY=es_CO.UTF-8 LC_MESSAGES=es_CO.UTF-8
[7] LC_PAPER=es_CO.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=es_CO.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached): [1] compiler_4.1.2

Thank you very much for your help.

Kenneth

jeroen commented 2 years ago

@krcabrer it works for me on ubuntu 20.04. Can you please show the full output of your installation log? You probably have multiple, conflicting versions of poppler installed on your machine.

krcabrer commented 2 years ago

Dear @jeroen: Following is the complete log of the procedure. I also uninstall and purge poppler libs and then I install them again. Only one version. And the issue continued...

  • installing source package ‘pdftools’ ... package ‘pdftools’ successfully unpacked and MD5 sums checked using staged installation Found pkg-config cflags and libs! Using PKG_CFLAGS=-I/usr/local/include/poppler/cpp -I/usr/local/include/poppler Using PKG_LIBS=-L/usr/local/lib -lpoppler-cpp libs g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I/usr/local/include/poppler/cpp -I/usr/local/include/poppler -I'/home/kenneth/R/x86_64-pc-linux-gnu-library/4.1/Rcpp/include' -fvisibility=hidden -fpic -g -O2 -fdebug-prefix-map=/build/r-base-i2PIHO/r-base-4.1.2=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c RcppExports.cpp -o RcppExports.o g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I/usr/local/include/poppler/cpp -I/usr/local/include/poppler -I'/home/kenneth/R/x86_64-pc-linux-gnu-library/4.1/Rcpp/include' -fvisibility=hidden -fpic -g -O2 -fdebug-prefix-map=/build/r-base-i2PIHO/r-base-4.1.2=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c bindings.cpp -o bindings.o g++ -std=gnu++11 -shared -L/usr/lib/R/lib -Wl,-Bsymbolic-functions -Wl,-z,relro -o pdftools.so RcppExports.o bindings.o -L/usr/local/lib -lpoppler-cpp -L/usr/lib/R/lib -lR installing to /home/kenneth/R/x86_64-pc-linux-gnu-library/4.1/00LOCK-pdftools/00new/pdftools/libs R byte-compile and prepare package for lazy loading help * installing help indices building package indices ** testing if installed package can be loaded from temporary location Error: package or namespace load failed for ‘pdftools’ in dyn.load(file, DLLpath = DLLpath, ...): unable to load shared object '/home/kenneth/R/x86_64-pc-linux-gnu-library/4.1/00LOCK-pdftools/00new/pdftools/libs/pdftools.so': /home/kenneth/R/x86_64-pc-linux-gnu-library/4.1/00LOCK-pdftools/00new/pdftools/libs/pdftools.so: undefined symbol: _ZNK7poppler8text_box13has_font_infoEv Error: loading failed Ejecución interrumpida ERROR: loading failed
  • removing ‘/home/kenneth/R/x86_64-pc-linux-gnu-library/4.1/pdftools’
  • restoring previous ‘/home/kenneth/R/x86_64-pc-linux-gnu-library/4.1/pdftools’ Warning in install.packages : installation of package ‘pdftools’ had non-zero exit status

Thank you for your help.

Kenneth

krcabrer commented 2 years ago

Dear @jeroen, I found the solution. I use this ppa repository for poppler.

sudo add-apt-repository ppa:bzamecnik/poppler

Then I update and now the package compilation works fine.

It seems that the problem is about the poppler default version that was installed on the system.

Greetings from Medellín, Colombia, South America.

Kenneth

dcaud commented 2 years ago

Thanks for releasing pdftools 3.1.0, which seems likely to fix the issue I posted on Mac and Windows.

However, I'd like to use this on Linux. Waiting until April and then upgrading to the newer version of Linux will be quite difficult for me. I'm several linux distro's behind 22.

If that's the way to go, I'll try when that happens. If there is anyway to not make pdftools depend on Linux version for this fix, that'd be great...but ultimately this isn't a dealbreaker for me. Thanks!

jeroen commented 2 years ago

We could create a ppa with a newer version of poppler. What disto are you using?

dcaud commented 2 years ago

Updated.

Hi Jeroen. Thanks for looking into this. I'm using this distro:

Distributor ID: Debian Description: Debian GNU/Linux 11 (bullseye) Release: 11 Codename: bullseye

I imagine that a ppa isn't really a longterm solution. If I wait until Apr. should the fix you suggested earlier work?

dcaud commented 2 years ago

Hello again. I updated pdftools on Mac and the PDF mentioned in the first post of this thread now renders as expected on my Mac.

However, it doesn't render as expected on shinyapps.io. Any idea how to make it work there? @jeroen mentioned above that updating poppler may be tricky for ubuntu (which is what I think is used by shinyapps.io).