ropensci / tabulapdf

Bindings for Tabula PDF Table Extractor Library
https://docs.ropensci.org/tabulapdf/
Apache License 2.0
540 stars 70 forks source link

Can't get the tables from PDF using "extract_tables" #61

Open zx8754 opened 7 years ago

zx8754 commented 7 years ago

I have below PDF, which seems to have "clean" tables. But extract_tables() gives me an empty list. http://databank.worldbank.org/data/download/GDP.pdf

library(tabulizer) # tabulizer_0.1.24

# read from local PDF file
# myPDF <- extract_tables("GDP.pdf")

# read from link
myPDF <- extract_tables("http://databank.worldbank.org/data/download/GDP.pdf")

length(myPDF)
# [1] 0 

I tried to use extract_areas, which works fine.

Any pointers why wouldn't extract_tables work? Maybe I missing some arguments?

> sessionInfo()
R version 3.4.1 Patched (2017-07-04 r72891)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tabulizer_0.1.24

loaded via a namespace (and not attached):
[1] tabulizerjars_0.9.2 compiler_3.4.1      tools_3.4.1         rJava_0.9-8        
[5] png_0.1-7    
scottkosty commented 7 years ago

I can reproduce. I wonder if extract_tables gets confused by the header lines. It would be nice if this worked automatically, since the PDF indeed is pretty clean. My guess is that this is an upstream issue (https://github.com/tabulapdf/tabula-java/) but I'd be happy if I were wrong.

I just wanted to note that you could set the area argument of extract_tables. I know that's not ideal, but better than doing it interactively for all of the pages.

leeper commented 7 years ago

There's an update of Tabula that was released last week, which apparently includes a number of fixes and improvements. It is going to require a bit of work to integrate, but I will revisit this once those I have the new version working to see if that solves this.

ChetanArvindPatil commented 5 years ago

@leeper @zx8754

Something similar happening with me. I can read the tables in the PDF file but only header values are read and not the table contents.

Any suggestion on how to solve this?

ghost commented 5 years ago

This doc. PDF https://www.qc.cuny.edu/About/Research/Documents/Fact_Book_2014-2015_Final.pdf

my problem is don't extract an table on page 86 while the others pages extract_tables it works normally

Any solution?