ropensci / tabulapdf

Bindings for Tabula PDF Table Extractor Library
https://docs.ropensci.org/tabulapdf/
Apache License 2.0
548 stars 71 forks source link

inconsistent behavior of extract_tables and extract_areas #145

Closed datapumpernickel closed 2 years ago

datapumpernickel commented 2 years ago

Please specify whether your issue is about:

First: Thank you very much for this awesome package. It has saved me tremendous headaches in the past!

Now I have a weird behavior, that I cannot really wrap my head around. When I do extract_areas() and locate the table, the result looks fine - I get back a complete table in the usual format. When I do extract_tables() with the exact same area specified, the result is only list(). I do not understand, why one returns the table and the other does not. I would appreciate your input!

Thanks in advance.

Put your code here:

## rJava loads successfully
# install.packages("rJava")
library("rJava")
library("tidyverse")

## load package
library("tabulizer")

httr::GET(
  "https://www.bmwi.de/Redaktion/DE/Publikationen/Aussenwirtschaft/ruestungsexportbericht-2019.pdf?__blob=publicationFile",
  httr::write_disk("temp.pdf")
)

tabulizer::extract_areas("temp.pdf",
                         pages = 82) %>%
  as.data.frame()

tabulizer::extract_tables("temp.pdf",
                         pages = 82)

locate_areas("temp.pdf",
             pages = 82)

tabulizer::extract_tables("temp.pdf",
                         pages = 82,
                         area = list(c(169.78232,  32.63903, 735.16167, 551.83787))) 

## session info for your system
sessionInfo()

R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server x64 (build 14393)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rJava_1.0-4       rvest_1.0.1       jsonlite_1.7.2    httr_1.4.2        shiny_1.7.0       pdftools_3.0.1    tabulizer_0.2.2  
 [8] SWPcdR_0.0.0.9000 extrafont_0.17    janitor_2.1.0     forcats_0.5.1     stringr_1.4.0     dplyr_1.0.7       purrr_0.3.4      
[15] readr_2.0.1       tidyr_1.1.4       tibble_3.1.4      ggplot2_3.3.5     tidyverse_1.3.1   pacman_0.5.1     

loaded via a namespace (and not attached):
 [1] fs_1.5.0            sf_1.0-2            lubridate_1.7.10    tools_4.1.1         padr_0.6.0          backports_1.2.1    
 [7] bslib_0.3.0         utf8_1.2.2          R6_2.5.1            KernSmooth_2.23-20  DBI_1.1.1           colorspace_2.0-2   
[13] withr_2.4.3         sp_1.4-5            tidyselect_1.1.1    curl_4.3.2          compiler_4.1.1      extrafontdb_1.0    
[19] cli_3.0.1           xml2_1.3.2          sass_0.4.0          scales_1.1.1        classInt_0.4-3      proxy_0.4-26       
[25] askpass_1.1         digest_0.6.27       pkgconfig_2.0.3     htmltools_0.5.2     dbplyr_2.1.1        fastmap_1.1.0      
[31] rlang_0.4.11        readxl_1.3.1        rstudioapi_0.13     jquerylib_0.1.4     generics_0.1.1      magrittr_2.0.1     
[37] Rcpp_1.0.7          munsell_0.5.0       fansi_0.5.0         lifecycle_1.0.1     stringi_1.7.4       snakecase_0.11.0   
[43] grid_4.1.1          promises_1.2.0.1    crayon_1.4.2        miniUI_0.1.1.1      lattice_0.20-44     haven_2.4.3        
[49] hms_1.1.1           pillar_1.6.4        reprex_2.0.1        glue_1.4.2          qpdf_1.1            modelr_0.1.8       
[55] tabulizerjars_1.0.1 selectr_0.4-2       png_0.1-7           vctrs_0.3.8         tzdb_0.1.2          httpuv_1.6.3       
[61] Rttf2pt1_1.3.9      cellranger_1.1.0    gtable_0.3.0        assertthat_0.2.1    cachem_1.0.6        mime_0.12          
[67] xtable_1.8-4        broom_0.7.10        countrycode_1.3.0   e1071_1.7-8         rnaturalearth_0.1.0 later_1.3.0        
[73] class_7.3-19        giscoR_0.2.4        units_0.7-2         writexl_1.4.0       ellipsis_0.3.2   
AlbanSagouis commented 2 years ago

Hi,

This looks like the problem I have and my understanding of it is that tabulizer::extract_tables() just does not take the area argument into account. I tried passing coordinates by hand or directly passing the list given by tabulizer::locate_areas() and the result is always the same as not passing an area at all and the whole page is scanned. Plus no error or warning message is shown which is confusing.

I would love that problem to be solved.

datapumpernickel commented 2 years ago

Upon further inspection, I have come to the conclusion, that the area argument is ignored (same as the columns argument), if guess = T. This should probably be added in the documentation. If you specify guess = F, it does indeed take the area into account...

AlbanSagouis commented 2 years ago

You're right, thanks a lot @datapumpernickel . Would you like to write a PR addressing this? If not, I'll happily do it.

datapumpernickel commented 2 years ago

Thanks @AlbanSagouis, PR looks great. I do not have much experience with writing packages or testthat, so I am impressed. :) Fixed through warning in #150