pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF
https://pdfminersix.readthedocs.io
MIT License
5.97k stars 932 forks source link

Improve distance function for textboxes #322

Open pietermarsman opened 5 years ago

pietermarsman commented 5 years ago

The current distance function computes the area between two textboxes. This can prioritize the grouping of textboxes A and B, while C is in between A and B. This is solved in the code by checking if there are textboxes between the to-be-grouped textboxes (the function isany).

I think the distance function (dist) can be improved so that it does not have to check for intermediate textboxes.

Ideas:

  1. First order by distance in vertical direction, and then horizontal direction. This would be especially intuitive for converting to plain text, as it follows the reading direction more naturally.
ReubenJCarter commented 2 weeks ago

Does isany even do anything right now? It just seems to pop from dists, then push right back on, skipping next time...