Open cinjon opened 8 years ago
In the same document, if I run tabula-java.jar -f JSON -g -r -p all ./2217.2015.09.04.平成28年-3-10.pdf
, then the output for just the last page is four tables with the following characteristics:
(top, height, left, width, # of rows) (0.118, 841.6806640625, 0.0, 594.719970703125, 7) (0.118, 841.6806640625, 0.0, 594.719970703125, 11) (0.118, 841.6806640625, 0.0, 594.719970703125, 6) (0.118, 841.6806640625, 0.0, 594.719970703125, 11)
How is this the case if it should be yielding the same output as the Tabula GUI which has four tables but each with different top/height/left/width attributes?
I just ran a quick test, and confirmed my suspicion. This is the output of java -cp target/tabula-extractor-0.7.4-SNAPSHOT-jar-with-dependencies.jar technology.tabula.debug.Debug -r -p 1 2217.2015.09.04.平成28年-3-10.pdf
It shows that there are “ruling” objects, presumably white, that Tabula misinterprets as table boundaries. Furthermore, there are are two rectangles on that page, one contained inside the other (see the ruling lines right at the page boundary)
Tabula’s (and tabula-java
) table detection algorithm is quite rudimentary: if it finds a set of 4 lines that form a rectangle, it’ll detect that as a table boundary. A possible solution to this (when the lines are 'invisible' or the same color of the page background) is described in issue #21.
https://www.dropbox.com/s/4nvptahkr1d1w16/2217.2015.09.04.%E5%B9%B3%E6%88%9028%E5%B9%B4-3-10.pdf?dl=0
In the pdf at the link above:
Notably, when I click the preview, pages 1-3 display the text as just large blobs.