ropensci / pdftools

Text Extraction, Rendering and Converting of PDF Documents
https://docs.ropensci.org/pdftools
Other
520 stars 69 forks source link

Getting a poppler version error when trying to use pdf_data() #44

Closed sethmund closed 6 years ago

sethmund commented 6 years ago

Can't use the pdf_data() function due to version error:

Error in poppler_pdf_data(loadfile(pdf), opw, upw) : 
  This feature requires poppler >= 0.63. You have 0.61.0
Session info ---------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.5.0 (2018-04-23)
 system   i386, mingw32               
 ui       RStudio (1.1.453)           
 language (EN)                        
 collate  English_United States.1252  
 tz       America/New_York            
 date     2018-08-13                  

Packages -------------------------------------------------------------------------------------------
 package   * version date       source                            
 base      * 3.5.0   2018-04-23 local                             
 compiler    3.5.0   2018-04-23 local                             
 datasets  * 3.5.0   2018-04-23 local                             
 devtools  * 1.13.6  2018-06-27 CRAN (R 3.5.1)                    
 digest      0.6.15  2018-01-28 CRAN (R 3.5.1)                    
 graphics  * 3.5.0   2018-04-23 local                             
 grDevices * 3.5.0   2018-04-23 local                             
 memoise     1.1.0   2017-04-21 CRAN (R 3.5.1)                    
 methods   * 3.5.0   2018-04-23 local                             
 pdftools  * 1.8     2018-08-13 Github (ropensci/pdftools@30f1f4c)
 Rcpp        0.12.18 2018-07-23 CRAN (R 3.5.1)                    
 stats     * 3.5.0   2018-04-23 local                             
 tools       3.5.0   2018-04-23 local                             
 utils     * 3.5.0   2018-04-23 local                             
 withr       2.1.2   2018-03-15 CRAN (R 3.5.1)                    
 yaml        2.2.0   2018-07-25 CRAN (R 3.5.1)  
odeleongt commented 6 years ago

In your case, the required poppler version is higher than the one you have installed (0.63 > 0.61.0).

In any case, even if you have the correct version, since the preprocessor check in these lines always results in POPPLER_HAS_PAGE_TEXT_LIST not being defined:

https://github.com/ropensci/pdftools/blob/30f1f4c19fd6443cdeb7e83a3184b33a74cfd27f/src/bindings.cpp#L15-L17

so lines between #if and #else will never get compiled in the poppler_pdf_data and it will always return the message you see (even if you have the correct version, >=0.63):

https://github.com/ropensci/pdftools/blob/30f1f4c19fd6443cdeb7e83a3184b33a74cfd27f/src/bindings.cpp#L168-L204

odeleongt commented 6 years ago

Maybe the author intends this function to not be currently available, regardless of your poppler version, or maybe it got commented at some point and never reverted.

If you really need to use pdftools::pdf_data... clone this repo, remove the 0 && in

https://github.com/ropensci/pdftools/blob/30f1f4c19fd6443cdeb7e83a3184b33a74cfd27f/src/bindings.cpp#L15

and install locally from source (i.e. update documentation and devtoolls::install() within the pdftools package project).

This worked for me, and I get very interesting data about each rendered object:

# Show my poppler version
pdftools::poppler_config()
$version
[1] "0.69.0"

$can_render
[1] TRUE

$supported_image_formats
[1] "png"  "jpeg" "jpg"  "tiff" "pnm" 
# Read a pdf file you are likely to have
head(
  pdftools::pdf_data(
    list.files(
      system.file(package = "utils"),
      pattern = ".pdf",
      full.names = TRUE,
      recursive = TRUE
    )
  )[[1]], # Only output the first page
  n = 7
)
       text width height   x   y space
1    Sweave    49     17 234 139  TRUE
2      User    31     17 289 139  TRUE
3    Manual    51     17 325 139 FALSE
4 Friedrich    45     11 235 172  TRUE
5    Leisch    31     11 284 172  TRUE
6       and    18     11 320 172  TRUE
7    R-core    33     11 343 172 FALSE
odeleongt commented 6 years ago

Yep. Looks like @jeroen set this in 369e3fb to not be included, coming from a previous commit stating it was "experimental support for pdf_data()".

Works for me, but I don't know if it would cause issues on other setups.

@jeroen, if I add tests for this, do you think it could be included in a future release?

sethmund commented 6 years ago

Thanks, it's doesn't exactly return anything useful for me and I eventually went the PyPDF2 route to extract the form data since pdftools simply required way too much custom parsing.

jeroen commented 6 years ago

The problem is that this feature is still not working properly in libpoppler. Unfortunately the poppler maintainer is not really responsive in fixing it....

sethmund commented 6 years ago

Any chance for a field name and value extraction function maybe using Java or C++ in the near future? Would be very useful for public institutions that use fillable PDFs.

Nowosad commented 5 years ago

Just to confirm these message mentioned by @odeleongt exists:

library(pdftools)
pdf_file <- file.path(R.home("doc"), "NEWS.pdf")
data = pdf_data(pdf_file)
#> Error in poppler_pdf_data(loadfile(pdf), opw, upw): This feature requires poppler >= 0.63. You have 0.67.0

Created on 2018-11-22 by the reprex package (v0.2.1)

jeroen commented 5 years ago

This was finally fixed in libpoppler, and will be in the next version of pdftools.

jeroen commented 5 years ago

This is fixed in pdftool 2.0, which is now on CRAN.