tabulapdf / tabula-java

Extract tables from PDF files
MIT License
1.82k stars 424 forks source link

Auto-detect problems. #40

Open cinjon opened 8 years ago

cinjon commented 8 years ago

https://www.dropbox.com/s/4nvptahkr1d1w16/2217.2015.09.04.%E5%B9%B3%E6%88%9028%E5%B9%B4-3-10.pdf?dl=0

In the pdf at the link above:

Notably, when I click the preview, pages 1-3 display the text as just large blobs.

cinjon commented 8 years ago

In the same document, if I run tabula-java.jar -f JSON -g -r -p all ./2217.2015.09.04.平成28年-3-10.pdf, then the output for just the last page is four tables with the following characteristics:

(top, height, left, width, # of rows) (0.118, 841.6806640625, 0.0, 594.719970703125, 7) (0.118, 841.6806640625, 0.0, 594.719970703125, 11) (0.118, 841.6806640625, 0.0, 594.719970703125, 6) (0.118, 841.6806640625, 0.0, 594.719970703125, 11)

How is this the case if it should be yielding the same output as the Tabula GUI which has four tables but each with different top/height/left/width attributes?

jazzido commented 8 years ago

I just ran a quick test, and confirmed my suspicion. This is the output of java -cp target/tabula-extractor-0.7.4-SNAPSHOT-jar-with-dependencies.jar technology.tabula.debug.Debug -r -p 1 2217.2015.09.04.平成28年-3-10.pdf

2217 2015 09 04 28 -3-10-1

It shows that there are “ruling” objects, presumably white, that Tabula misinterprets as table boundaries. Furthermore, there are are two rectangles on that page, one contained inside the other (see the ruling lines right at the page boundary)

Tabula’s (and tabula-java) table detection algorithm is quite rudimentary: if it finds a set of 4 lines that form a rectangle, it’ll detect that as a table boundary. A possible solution to this (when the lines are 'invisible' or the same color of the page background) is described in issue #21.