ropensci / tabulapdf

Bindings for Tabula PDF Table Extractor Library
https://docs.ropensci.org/tabulapdf/
Apache License 2.0
547 stars 71 forks source link

Warning For Reflective Access #106

Open billdenney opened 5 years ago

billdenney commented 5 years ago

When working with the current version of R and rJava, there is a warning with extract_table() indicating:

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by RJavaTools to method java.util.ArrayList$Itr.hasNext()
WARNING: Please consider reporting this to the maintainers of RJavaTools
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release

Unfortunately, I cannot share the underlying .pdf file that caused the error.

fpinter commented 5 years ago

I reproduced using the first code example from the readme.

library("tabulizer")
f <- system.file("examples", "data.pdf", package = "tabulizer")
out1 <- extract_tables(f)

(Mac 10.13, tabulizer 0.2.2, rJava 0.9-11, R 3.6.0, Java 11.0.1)

bedantaguru commented 5 years ago

Getting the same in Linux too

ziembaej commented 5 years ago

Just got the same warning. Using R version 3.6.0 (2019-04-26) on Mac OS 10.14.6

Has this caused any actual problems for others?

bedantaguru commented 5 years ago

In Travis it causes build failure.

antonio1970 commented 5 years ago

Anyone was able to solve it, I got the same error

MattCowgill commented 4 years ago

Same here

dernapo commented 4 years ago

Same issue here

sessionInfo() R version 3.6.3 (2020-02-29) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.4 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale: [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8
[4] LC_COLLATE=de_DE.UTF-8 LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8
[7] LC_PAPER=de_DE.UTF-8 LC_NAME=de_DE.UTF-8 LC_ADDRESS=de_DE.UTF-8
[10] LC_TELEPHONE=de_DE.UTF-8 LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=de_DE.UTF-8

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] janitor_1.2.0 tabulizer_0.2.2 data.table_1.12.6 tidytext_0.2.0 dplyr_0.8.3
[6] stringr_1.4.0 rvest_0.3.4 xml2_1.2.2 selectr_0.4-1 cronR_0.4.0

lefcgis commented 4 years ago

I have the same problem. I'm wondering if this problem is about the "quality document". In other words, there are documents (pdf's) can use it with Tabulizer. But, others not.

For example, if you download this pdf you can use Tabulizer. However, if you use this one cannot. I don't know why!. I don't believe illegal problems with the document. I think the "quality of information".

If you make a paper in Word or Excel, then export to pdf and try it, you can do it! So, it seems Tabulizer algorithm doesn't work in all pdf documents 🧙‍♂️

P.S. I ran in RStudio 1.2.5033 an R 3.6.3 (2020-02-29)

billdenney commented 4 years ago

@lefcgis, there definitely could be some documents that trigger the issue and some that do not, but it is a Java coding issue and not an issue with a PDF file (as in, the pdf standard is being followed). For more information, see https://stackoverflow.com/questions/50251798/what-is-an-illegal-reflective-access

lefcgis commented 4 years ago

Vale! So, it's possible that the reason would be Jdk and Jdr packages, because there are prewiew prerequisites to install rJava. Thanks for your answer, @billdenney 🧙‍♂️

bedantaguru commented 4 years ago

Now it's causing to break my build

cjyetman commented 4 years ago

For me, this warning only occurs the first time the example code is run in a new R session. Subsequent runs do not show this warning. Is that the same behavior others here are seeing?

The test code I've been using is...

out <- tabulizer::extract_tables(system.file("examples", "data.pdf", package = "tabulizer"))

If so, I'm curious if #125 resolves this issue for you.

maahutch commented 3 years ago

Same thing happened to me. Got this error the first time then just an empty list each subsequent run. I can read other pdfs but it fails on one which is a different format.

R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tabulizer_0.2.2

loaded via a namespace (and not attached):
[1] tabulizerjars_1.0.1 compiler_4.0.2      tools_4.0.2         rJava_0.9-13       
[5] png_0.1-7 
cjyetman commented 3 years ago

@maahutch An error, or a warning? Those are significantly different.

bbolker commented 3 months ago

FWIW I'm getting a WARNING from Java (not R), and an empty list, the first time. Subsequently I get an empty list without a warning from Javascript.

It's possible that this particular PDF is image-only and has no underlying text anyway .. ?

pg6.pdf

pachadotdev commented 3 months ago

FWIW I'm getting a WARNING from Java (not R), and an empty list, the first time. Subsequently I get an empty list without a warning from Javascript.

It's possible that this particular PDF is image-only and has no underlying text anyway .. ?

pg6.pdf

yes, that would require OCR (i.e., tesseract or paws)