Closed sethmund closed 6 years ago
In your case, the required poppler
version is higher than the one you have installed (0.63 > 0.61.0).
In any case, even if you have the correct version, since the preprocessor check in these lines always results in POPPLER_HAS_PAGE_TEXT_LIST
not being defined:
so lines between #if
and #else
will never get compiled in the poppler_pdf_data
and it will always return the message you see (even if you have the correct version, >=0.63):
Maybe the author intends this function to not be currently available, regardless of your poppler
version, or maybe it got commented at some point and never reverted.
If you really need to use pdftools::pdf_data
... clone this repo, remove the 0 &&
in
and install locally from source (i.e. update documentation and devtoolls::install()
within the pdftools
package project).
This worked for me, and I get very interesting data about each rendered object:
# Show my poppler version
pdftools::poppler_config()
$version
[1] "0.69.0"
$can_render
[1] TRUE
$supported_image_formats
[1] "png" "jpeg" "jpg" "tiff" "pnm"
# Read a pdf file you are likely to have
head(
pdftools::pdf_data(
list.files(
system.file(package = "utils"),
pattern = ".pdf",
full.names = TRUE,
recursive = TRUE
)
)[[1]], # Only output the first page
n = 7
)
text width height x y space
1 Sweave 49 17 234 139 TRUE
2 User 31 17 289 139 TRUE
3 Manual 51 17 325 139 FALSE
4 Friedrich 45 11 235 172 TRUE
5 Leisch 31 11 284 172 TRUE
6 and 18 11 320 172 TRUE
7 R-core 33 11 343 172 FALSE
Yep. Looks like @jeroen set this in 369e3fb to not be included, coming from a previous commit stating it was "experimental support for pdf_data()".
Works for me, but I don't know if it would cause issues on other setups.
@jeroen, if I add tests for this, do you think it could be included in a future release?
Thanks, it's doesn't exactly return anything useful for me and I eventually went the PyPDF2 route to extract the form data since pdftools simply required way too much custom parsing.
The problem is that this feature is still not working properly in libpoppler. Unfortunately the poppler maintainer is not really responsive in fixing it....
Any chance for a field name and value extraction function maybe using Java or C++ in the near future? Would be very useful for public institutions that use fillable PDFs.
Just to confirm these message mentioned by @odeleongt exists:
library(pdftools)
pdf_file <- file.path(R.home("doc"), "NEWS.pdf")
data = pdf_data(pdf_file)
#> Error in poppler_pdf_data(loadfile(pdf), opw, upw): This feature requires poppler >= 0.63. You have 0.67.0
Created on 2018-11-22 by the reprex package (v0.2.1)
This was finally fixed in libpoppler, and will be in the next version of pdftools.
This is fixed in pdftool 2.0, which is now on CRAN.
Can't use the pdf_data() function due to version error: