melisabok / tabula-java

Extract tables from PDF files
MIT License
1 stars 2 forks source link

TestTableDetection.[35]: Expected one table and detected two #14

Open melisabok opened 7 years ago

melisabok commented 7 years ago

File us-009.pdf

New code is detecting more rulings that the old code.

New rulings: new_with_text

Old rulings: old_with_text

That's why is detecting 2 tables instead of 1, see images:

New: us-009-1

Old: us-009-1

I think it is ok to detect 2 tables, what should we do in this case?

jazzido commented 7 years ago

That's an interesting side effect of the improvements in PDFBox 2.0: the old version missed some lines.

Also, we've run into this case before. Sometimes, the table detection algorithm picks up two "tables", one contained inside the other. Unfortunately, we haven't arrived to a decision on what to do. My guess is that we should build a tree of rectangles (using containedIn as the linkage criteria) and keep the outermost element. @jeremybmerrill any ideas?

melisabok commented 7 years ago

I found the comparator in the NurminenDetectionAlgorithm and I made a fix to make the tests pass.

I'm not sure if this is the right solution, because this comparator doesn't ensure that the TreeSet keeps the outermost table, this depends of the order of the tables that you send in the addAll:

tableSet.addAll(tableAreas);

With this fix all the TestTableDetection tests are passing.