Auto-detect problems. - Githubissues

tabulapdf / tabula-java

Extract tables from PDF files

MIT License

1.82k stars 424 forks source link

Auto-detect problems. #40

Open cinjon opened 8 years ago

cinjon commented 8 years ago

https://www.dropbox.com/s/4nvptahkr1d1w16/2217.2015.09.04.%E5%B9%B3%E6%88%9028%E5%B9%B4-3-10.pdf?dl=0

In the pdf at the link above:

Both the GUI Tabula and tabula-java find every page as being a table.
- Pages 2 and 3 should have no tables.
- Page 1 (TOC) arguably is a table but I wouldn't expect it to find this format consistently.
- Pages 4-7 are tabular but it won't be easy to discover the table from the given output because there is a lot of output. Using Page 4 as an example, besides the page at large, the return also includes a subsection covering from the "3.四半期財務諸表" to just above the page number as well as every blue row.
- Page 8 has two legitimate tables [that I need to expand the bottom and right for to see completely] but auto-detect also finds two parent tables that act as supersets to the legitimate ones.

Notably, when I click the preview, pages 1-3 display the text as just large blobs.

cinjon commented 8 years ago

In the same document, if I run tabula-java.jar -f JSON -g -r -p all ./2217.2015.09.04.平成28年-3-10.pdf, then the output for just the last page is four tables with the following characteristics:

(top, height, left, width, # of rows) (0.118, 841.6806640625, 0.0, 594.719970703125, 7) (0.118, 841.6806640625, 0.0, 594.719970703125, 11) (0.118, 841.6806640625, 0.0, 594.719970703125, 6) (0.118, 841.6806640625, 0.0, 594.719970703125, 11)

How is this the case if it should be yielding the same output as the Tabula GUI which has four tables but each with different top/height/left/width attributes?

jazzido commented 8 years ago

I just ran a quick test, and confirmed my suspicion. This is the output of java -cp target/tabula-extractor-0.7.4-SNAPSHOT-jar-with-dependencies.jar technology.tabula.debug.Debug -r -p 1 2217.2015.09.04.平成28年-3-10.pdf

2217 2015 09 04 28 -3-10-1

It shows that there are “ruling” objects, presumably white, that Tabula misinterprets as table boundaries. Furthermore, there are are two rectangles on that page, one contained inside the other (see the ruling lines right at the page boundary)

Tabula’s (and tabula-java) table detection algorithm is quite rudimentary: if it finds a set of 4 lines that form a rectangle, it’ll detect that as a table boundary. A possible solution to this (when the lines are 'invisible' or the same color of the page background) is described in issue #21.