gsireesh / ht-max

Code for the HT-MAX project
Apache License 2.0
0 stars 1 forks source link

Fix paragraph merging #37

Closed gsireesh closed 5 months ago

gsireesh commented 5 months ago

This PR does a couple things to fix paragraph merging - it updates the paragraph merging algorithm to look at all possible overlaps for a given span, expands the tolerance for in-column grouping of sentence boxes, and adds a vertical monotonicity constraint for grouping paragraphs, that accounts for Grobid's tendency to add figure captions into paragraph bounding boxes.

Note that this PR does NOT do the right thing to do for the grobid parser — spatial clustering of boxes into groups. This would be a lot more resilient, and also account for cases where one box potentially overlaps with a lot of others. This is still causing some degree of parsing error, but I do not have time to fix it now.