Open bertsky opened 3 years ago
How is it possible for the page segmentation to create overlapping regions? Since the model is basically a pixel classifer, it should force some flat partitioning. (Or is this really just about convex hull vs finding an alpha shape?)
The pixel classification is turned into polygonal regions that are then going through operations that dilate and erode, which can lead to overlaps. For a high-level overview, see http://ceur-ws.org/Vol-2723/long20.pdf
The set of polygonal operations that finally lead to overlaps is defined here: https://github.com/poke1024/origami/blob/544485bfdacf28baa420d54fef8bf087fa0c4f2b/origami/custom/layouts/bbz.py#L59
Implementation of the different operations starts here: https://github.com/poke1024/origami/blob/544485bfdacf28baa420d54fef8bf087fa0c4f2b/origami/batch/detect/layout.py#L310
Why did the table detector not pick up the full regular structure on the right? What would the cells and lines need to look like? (Or is this related to the perspective distortion? Or does it expect or get triggered by fg column separators?)
Are we talking about the second example page? It looks to me like the pixel classifier already gets this wrong, i.e. classifies this as text. This would mean that our BBZ table training data did not generalize well for this case.
How is it even possible for Origami to create a PAGE-XML for the original image, if you even included page-level dewarping in between (i.e. how do you keep track of the coordinate system)?
The PAGE-XML is indeed output for the warped page, coordinates are transformed from dewarped into warped space for the export.
The dewarping transformation is basically a grid of dewarped points that models dewarping through linear interpolations and works both ways, i.e. warped -> dewarped and dewarped -> warped. So, for each regular grid point, there is one dewarped grid point, and this mapping of quadrilaterals defined the dewarping. This mapping is available at all post-dewarping stages of the pipeline (it's saved into a separate file).
Implementation is at https://github.com/poke1024/origami/blob/master/origami/core/dewarp.py where the Transformer class implements the actual interpolation, see https://github.com/poke1024/origami/blob/544485bfdacf28baa420d54fef8bf087fa0c4f2b/origami/core/dewarp.py#L143.
Why are text lines sometimes split in the middle (vertically)? (Happens a couple of times per page.)
This is probably related to fine tuning of polygonal operations (also see questions below). Either the constituent text line polygons do not get merged in the first place (you might want to look into segment.zip
which contains the raw pixel classifier output as png, image ratio is wrong there), or they get merged and then get split again by an operation called FixSpillOverH
(see https://github.com/poke1024/origami/blob/544485bfdacf28baa420d54fef8bf087fa0c4f2b/origami/batch/detect/layout.py#L928 and https://github.com/poke1024/origami/blob/544485bfdacf28baa420d54fef8bf087fa0c4f2b/origami/custom/layouts/bbz.py#L59), which tries to find whitespace columns in regions and splits along them to fix spillover from the pixel classification. Removing that FixSpillOverH
(or changing its parameters) from the Transformer
in https://github.com/poke1024/origami/blob/544485bfdacf28baa420d54fef8bf087fa0c4f2b/origami/custom/layouts/bbz.py#L59 might fix this. The idea of FixSpillOverH
is that sometimes, in the pixel classifier, blocks do get merged which should not, and this tries to fix it - but sometimes it fixes too much.
Why are text region outlines smaller than their constituent text line contours?
This is a good question. My best ad hoc guess is the --contours-buffer
in https://github.com/poke1024/origami/blob/544485bfdacf28baa420d54fef8bf087fa0c4f2b/origami/batch/detect/lines.py#L158 which expands text line contours by some amount.
Why are text line contours overlapping each other? Would it be possible to get tight, non-overlapping polygonal contours?
Yes. The code location to experiment is at https://github.com/poke1024/origami/blob/544485bfdacf28baa420d54fef8bf087fa0c4f2b/origami/custom/layouts/bbz.py#L59 where you might want to remove operations or add a full overlap merge operation.
Instead of the current implementation you could use a
Transformer([
OverlapMerger(0)
])
which means merging all overlapping regions (starting at any overlap > 0) and not doing any dilations or erosions. This might be worth experimenting with.
The current default set of operations is fine-tuned towards some border cases encountered in the BBZ layout.
Is there a way to export the separator regions, too? Similarly, would it be possible to represent tables as recursive TableRegions instead of recursive TextRegions?
Not in the API or exports at this point, but after running the "contours" stage, you can unzip contours.0.zip
(in the .out
folder that Origami created in your docs folder) which contains separators/H and separators/V folders, which contain all separators as polygonal wkt files (read using shapely.wkt.loads). Coordinates are in warped space at this point.
In terms of PageXML export, there is some simple support for exporting TableRegions (see https://github.com/poke1024/origami/blob/544485bfdacf28baa420d54fef8bf087fa0c4f2b/origami/batch/detect/compose.py#L145), but this assumes that the earlier stages and the pixel classifier classify a region as table, which might go wrong in this use case.
I would need to look into this in more detail to give a better answer.
Not sure if this a bug at all. I've used your pretrained BBZ model to segment pages in similar data:
Börsenblatt des Deutschen Buchhandels
. These also have 2-column layouts besides the 3- and 4-column layouts ofBerliner Börsenzeitung
, and the advertisement parts look very different. But I assumed the domains are close enough for pages like this.The bbz-segment results (via full Origami pipeline and
compose --page-xml
) do look very good in general. This is truly amazing work!But some errors leave me puzzled:
(Sorry, cannot get these to render with equal width in GFM...)
Here what I don't understand is:
TableRegion
s instead of recursiveTextRegion
s?