knaw-huc / loghi

MIT License
96 stars 12 forks source link

How does reading order detection work with loghi #26

Open icarl-ad opened 1 week ago

icarl-ad commented 1 week ago

Hi,

I was wodering how exactly reading order detection works with loghi currently.

As far as I know, the reading order is determined rule-based. Are there multiple rules based on number / position of text regions / other conditions or is there just one rule? What are these rules?

And how are these rules applied, are the coordinates of the left upper corner of the text region crucial?

Thank you in advance!

rvankoert commented 3 days ago

Hi,

Reading order order is determined in two stages: first textlines are clustered based on their distance, resulting in "paragraphs" secondly the clusters are then put into order starting with the top-left cluster, adding clusters by looking downward until the only possibility is to look to another column and then adding those. In general it should try to make columns, but for more complex pages it will sometimes do strange things. There are some tweaks that will try to make sure that the reading order follows top-to-bottom and left-to-right, but if the clusters are not in a grid-like fashion strange things will happen. We are actively looking to improve here and welcome samples of pages with complex reading reading order.