monniert / docExtractor

(ICFHR 2020 oral) Code for "docExtractor: An off-the-shelf historical document element extraction" paper
https://www.tmonnier.com/docExtractor
MIT License
85 stars 10 forks source link

Trying to train a Text Region detector but failed #10

Closed seekingdeep closed 3 years ago

seekingdeep commented 3 years ago

@monniert Hi there, i have trained a new model to detect text regions/ paragraphs, the results were bad eventhough in training and validation the accuracy was high. The sample dataset https://drive.google.com/drive/folders/1bCuI9SYXOuRUeP4MXY0gfcaKu6O3_WlM?usp=sharing

Example: 1710_annotated

Groudtruth: 1710_seg

Original image: 1710

monniert commented 3 years ago

Hi, why do you say results are bad? I see only one mistake (2 paragraphs merged) occurring in predicted segmentations. Maybe you are referring to extracted regions which are much larger than predicted regions, but for that you need to play with extractor.ADDITIONAL_MARGIN_RATIO and set it to a value close to 0 in such case of paragraph extraction (and not thin text lines)

To prevent merged paragraphs, you can additionally predict paragraph border as done with text lines

seekingdeep commented 3 years ago

you can additionally predict paragraph border as done with text lines

What do you mean? do you have an example image so i understand.

There might be a solution for both the issues that i posted:

example

monniert commented 3 years ago

Sure, here is an example directly taken from SynDoc dataset

10004 10004_seg

Yes diva-hisdb annotations may be a solution but (i) this kind of annotations is very time consuming (you can do only a few pages per day) and (ii) I think they are a bit ambiguous (especially between words) and thus it will be difficult to learn and generalize

seekingdeep commented 3 years ago

1) The polygon based annotations can be generated from existing rectangle based boxes, or even be synthesized. It's easy. In my case, i have existing rectangle based annotations, which then can run an algorithm to detect the points of the text itself, similar to an ".svg" file, and then create connections between the letters, words using the closest points to each other. For synthesizing, this might be even easier since you can create the lines in an ".svg" format or a ".png" from start. These type of labels can be easily generated and synthesized. When using a polygon based annotations, the lines can be accurately separated even without a sophisticated text-detection method nor seem-carving, since they already accurately segmented and connected.

image

2) If you decide to stick with the (x-height+border) labeling method, then you might want to use 2 colors for boarders pf close regions, and even then you might still have some difficulties especially when the boarder of the 1rst region is too close, or even intersecting the 2nd region. The boarders work well for regions that have clear space between them, but even Printed text can seem irregular sometimes and act like handwriting text, by being too small, too close, or even intersecting each other. image

3) for may paragraph dataset, you stated that i should also predict the paragraph boarders. how can i do that? some paragraphs are very close to each other.

monniert commented 3 years ago
  1. yes if you can get such annotations for free, it may be worth trying, please keep me updated with the results you get but I suspect you will still have the problem of overlapping lines

  2. I am not sure about this solution, having 2 colors for a same semantic region (here borders around text) leads to ambiguity which often makes the learning of the network harder: say you rotate your document by 180degree why should it start with a specific color rather than the other? But again, I am curious, please keep me updated about the results you may end up with, if I were you I would first try 2 alternative colors on the textline regions (without any border, there is no need if this works) rather than 2 colors for the borders (as you first suggested)

  3. maybe try modeling the borders inside the paragraph regions: for each region, erode it for a couple of pixels (5?) then use the difference between the full region and eroded one to fill with the border color (see this tutorial for info about morphological operations)

monniert commented 3 years ago

@seekingdeep closing the issue, please reopen if necessary