The process of GT generation

CrazyCrud commented 3 years ago

First of all, thank you very much for your work! docExtractor extracts text lines very well out of the box.
But I want to fine-tune the model with custom data and I think my question is related to the process of creating the GT. As you stated in #10 you recommend to add some border around the annotated text lines. As I went through the examples on https://enherit.paris.inria.fr/ it seems that borders are not annotated explicitly.

Should borders be annotated (labeled as border) or do you recommend adding them via morphological operations in some kind of a post-process?

Moreover I'm not quite sure which labels where used when the model has been trained. On https://enherit.paris.inria.fr/ the text lines are labeled as text but the paper states labels like paragraph or table.

What labels should be used when using the default model as a pretrained model?

In my case I work with tabular data. Should text inside the cells therefore be labeled as table?
Screenshot from 2021-01-28 20-54-21

Thank you in advance!

monniert commented 3 years ago

Hi @CrazyCrud thanks for the interest in the project! Here are some answers:

we always filter the border annotations in our extractions results (https://enherit.paris.inria.fr/ or src/extractor.py) so that's why they don't show up
I would recommend annotating the x-height representation only (wiki) for example using VIA annotator, and then augmenting the ground-truth to generate borders either directly when converting the via json to images (I will see what I can do for #13 in the upcoming days) or after conversion, with morphological operations
the labels used to train the default model are illustration, text and text_border. There are two options to finetune it: (i) you care about extracting all these elements so you keep the same labels (colors) in your GT and finetuning is straightforward or (ii) you want to finetune on a different list of labels (completely different or a subset, in your case text and text_border), in that case the final conv1x1 layer would be randomly initialized but you will still strongly benefit from the rest of the pretrained network. The latter (ii) is the one performed to report the finetuned results on the baseline detection benchmarks (cBADs, table 1 and 2 in the paper)

Hope this helps

CrazyCrud commented 3 years ago

@monniert thank you very much for your detailed answer!

I would recommend annotating the x-height representation only (wiki) for example using VIA annotator [...] after conversion, with morphological operations

This sounds like a reasonable approach as you explained how to use erosion to generate colored borders in the other issue.

There are two options to finetune it

So independently of the list of labels, it always seems to be a good idea to use the pretrained model.

I'll now give it a try now and finetune the model. Your answer was very helpful.

monniert / docExtractor

The process of GT generation #12