poke1024 / origami

A suite of batches and tools for OCR tasks.
71 stars 15 forks source link

overlapping regions #7

Open bertsky opened 3 years ago

bertsky commented 3 years ago

Not sure if this a bug at all. I've used your pretrained BBZ model to segment pages in similar data: Börsenblatt des Deutschen Buchhandels. These also have 2-column layouts besides the 3- and 4-column layouts of Berliner Börsenzeitung, and the advertisement parts look very different. But I assumed the domains are close enough for pages like this.

The bbz-segment results (via full Origami pipeline and compose --page-xml) do look very good in general. This is truly amazing work!

But some errors leave me puzzled:

original origami
FILE_0016_ORIGINAL FILE_0016_ORIGINAL_pageviewer-all
FILE_0002_ORIGINAL FILE_0002_ORIGINAL_pageviewer-all

(Sorry, cannot get these to render with equal width in GFM...)

Here what I don't understand is:

  1. How is it possible for the page segmentation to create overlapping regions? Since the model is basically a pixel classifer, it should force some flat partitioning. (Or is this really just about convex hull vs finding an alpha shape?)
  2. Why did the table detector not pick up the full regular structure on the right? What would the cells and lines need to look like? (Or is this related to the perspective distortion? Or does it expect or get triggered by fg column separators?)
  3. How is it even possible for Origami to create a PAGE-XML for the original image, if you even included page-level dewarping in between (i.e. how do you keep track of the coordinate system)?
  4. Why are text lines sometimes split in the middle (vertically)? (Happens a couple of times per page.)
  5. Why are text region outlines smaller than their constituent text line contours?
  6. Why are text line contours overlapping each other? Would it be possible to get tight, non-overlapping polygonal contours?
  7. Is there a way to export the separator regions, too? Similarly, would it be possible to represent tables as recursive TableRegions instead of recursive TextRegions?
poke1024 commented 3 years ago

How is it possible for the page segmentation to create overlapping regions? Since the model is basically a pixel classifer, it should force some flat partitioning. (Or is this really just about convex hull vs finding an alpha shape?)

The pixel classification is turned into polygonal regions that are then going through operations that dilate and erode, which can lead to overlaps. For a high-level overview, see http://ceur-ws.org/Vol-2723/long20.pdf

The set of polygonal operations that finally lead to overlaps is defined here: https://github.com/poke1024/origami/blob/544485bfdacf28baa420d54fef8bf087fa0c4f2b/origami/custom/layouts/bbz.py#L59

Implementation of the different operations starts here: https://github.com/poke1024/origami/blob/544485bfdacf28baa420d54fef8bf087fa0c4f2b/origami/batch/detect/layout.py#L310

Why did the table detector not pick up the full regular structure on the right? What would the cells and lines need to look like? (Or is this related to the perspective distortion? Or does it expect or get triggered by fg column separators?)

Are we talking about the second example page? It looks to me like the pixel classifier already gets this wrong, i.e. classifies this as text. This would mean that our BBZ table training data did not generalize well for this case.

How is it even possible for Origami to create a PAGE-XML for the original image, if you even included page-level dewarping in between (i.e. how do you keep track of the coordinate system)?

The PAGE-XML is indeed output for the warped page, coordinates are transformed from dewarped into warped space for the export.

The dewarping transformation is basically a grid of dewarped points that models dewarping through linear interpolations and works both ways, i.e. warped -> dewarped and dewarped -> warped. So, for each regular grid point, there is one dewarped grid point, and this mapping of quadrilaterals defined the dewarping. This mapping is available at all post-dewarping stages of the pipeline (it's saved into a separate file).

Implementation is at https://github.com/poke1024/origami/blob/master/origami/core/dewarp.py where the Transformer class implements the actual interpolation, see https://github.com/poke1024/origami/blob/544485bfdacf28baa420d54fef8bf087fa0c4f2b/origami/core/dewarp.py#L143.

Why are text lines sometimes split in the middle (vertically)? (Happens a couple of times per page.)

This is probably related to fine tuning of polygonal operations (also see questions below). Either the constituent text line polygons do not get merged in the first place (you might want to look into segment.zip which contains the raw pixel classifier output as png, image ratio is wrong there), or they get merged and then get split again by an operation called FixSpillOverH (see https://github.com/poke1024/origami/blob/544485bfdacf28baa420d54fef8bf087fa0c4f2b/origami/batch/detect/layout.py#L928 and https://github.com/poke1024/origami/blob/544485bfdacf28baa420d54fef8bf087fa0c4f2b/origami/custom/layouts/bbz.py#L59), which tries to find whitespace columns in regions and splits along them to fix spillover from the pixel classification. Removing that FixSpillOverH (or changing its parameters) from the Transformer in https://github.com/poke1024/origami/blob/544485bfdacf28baa420d54fef8bf087fa0c4f2b/origami/custom/layouts/bbz.py#L59 might fix this. The idea of FixSpillOverH is that sometimes, in the pixel classifier, blocks do get merged which should not, and this tries to fix it - but sometimes it fixes too much.

Why are text region outlines smaller than their constituent text line contours?

This is a good question. My best ad hoc guess is the --contours-buffer in https://github.com/poke1024/origami/blob/544485bfdacf28baa420d54fef8bf087fa0c4f2b/origami/batch/detect/lines.py#L158 which expands text line contours by some amount.

Why are text line contours overlapping each other? Would it be possible to get tight, non-overlapping polygonal contours?

Yes. The code location to experiment is at https://github.com/poke1024/origami/blob/544485bfdacf28baa420d54fef8bf087fa0c4f2b/origami/custom/layouts/bbz.py#L59 where you might want to remove operations or add a full overlap merge operation.

Instead of the current implementation you could use a

Transformer([
    OverlapMerger(0)
])

which means merging all overlapping regions (starting at any overlap > 0) and not doing any dilations or erosions. This might be worth experimenting with.

The current default set of operations is fine-tuned towards some border cases encountered in the BBZ layout.

Is there a way to export the separator regions, too? Similarly, would it be possible to represent tables as recursive TableRegions instead of recursive TextRegions?

Not in the API or exports at this point, but after running the "contours" stage, you can unzip contours.0.zip (in the .out folder that Origami created in your docs folder) which contains separators/H and separators/V folders, which contain all separators as polygonal wkt files (read using shapely.wkt.loads). Coordinates are in warped space at this point.

In terms of PageXML export, there is some simple support for exporting TableRegions (see https://github.com/poke1024/origami/blob/544485bfdacf28baa420d54fef8bf087fa0c4f2b/origami/batch/detect/compose.py#L145), but this assumes that the earlier stages and the pixel classifier classify a region as table, which might go wrong in this use case.

I would need to look into this in more detail to give a better answer.