Open melisabok opened 7 years ago
That's an interesting side effect of the improvements in PDFBox 2.0: the old version missed some lines.
Also, we've run into this case before. Sometimes, the table detection algorithm picks up two "tables", one contained inside the other. Unfortunately, we haven't arrived to a decision on what to do. My guess is that we should build a tree of rectangles (using containedIn
as the linkage criteria) and keep the outermost element. @jeremybmerrill any ideas?
I found the comparator in the NurminenDetectionAlgorithm and I made a fix to make the tests pass.
I'm not sure if this is the right solution, because this comparator doesn't ensure that the TreeSet keeps the outermost table, this depends of the order of the tables that you send in the addAll:
tableSet.addAll(tableAreas);
With this fix all the TestTableDetection tests are passing.
File us-009.pdf
New code is detecting more rulings that the old code.
New rulings:
Old rulings:
That's why is detecting 2 tables instead of 1, see images:
New:
Old:
I think it is ok to detect 2 tables, what should we do in this case?