qurator-spk / eynollah

Document Layout Analysis
Apache License 2.0
340 stars 29 forks source link

Documentation: schematic and algorithms or heuristics used in-between. #119

Open prhbrt opened 9 months ago

prhbrt commented 9 months ago

I'm trying to get a better understanding of your work and creating a workflow that allows batching pages without reloading the models (which takes a lot of time currently). However, your code is sometimes somewhat hard to follow. Could you provide a (crude) schematic of the different models you're using as a graph and quick summary of the algorithmic (non-neural network) parts.

Currently I'm mostly confused by how the reading order is decided, what is the algorithm there?

cneud commented 9 months ago

Hi @prhbrt, thank you for your questions. A rough diagram showing the flow of the data through the various models can be found here.

And here is an excerpt from our paper describing the heuristics used for reading order detection:

We sort columns from left to right and any text regions they contain from top to bottom. We then divide the whole page into boxes based on separators and headings. What we need at the early stage are the coordinates of separators, headings and where the columns are located (X-coordinates). The algorithm can be explained as follows: First, separators (or headings) that cover the whole width or all columns of the page specify the main boxes and are read from top to down. Then the X-coordinates of columns in each main box are detected by the sum of text regions alongside the Y-axis. The minimums of this summation returns the X-coordinates of columns. If the main box includes separators covering multiple columns, those are divided into upper and lower boxes and finally the new boxes inside the main box are ordered from left to right. Reading order inside boxes with multiple columns is again from left to right. Finally, to get the reading order for text contours, the contours inside each box are ordered from top to bottom.

Note that @vahidrezanezhad is currently working on a version that infers the reading order using a machine learning model, see the most recent commits here.

cneud commented 9 months ago

that allows batching pages without reloading the models

btw, since version 0.3.0, Eynollah also has a batch mode (using the -di <directory> flag) that allows processing all images in a directory without having to reload the models for each - might perhaps be useful for you?