monniert / docExtractor

(ICFHR 2020 oral) Code for "docExtractor: An off-the-shelf historical document element extraction" paper
https://www.tmonnier.com/docExtractor
MIT License
85 stars 10 forks source link

The process of GT generation #12

Closed CrazyCrud closed 3 years ago

CrazyCrud commented 3 years ago

First of all, thank you very much for your work! docExtractor extracts text lines very well out of the box.
But I want to fine-tune the model with custom data and I think my question is related to the process of creating the GT. As you stated in #10 you recommend to add some border around the annotated text lines. As I went through the examples on https://enherit.paris.inria.fr/ it seems that borders are not annotated explicitly.

Moreover I'm not quite sure which labels where used when the model has been trained. On https://enherit.paris.inria.fr/ the text lines are labeled as text but the paper states labels like paragraph or table.

In my case I work with tabular data. Should text inside the cells therefore be labeled as table?
Screenshot from 2021-01-28 20-54-21

Thank you in advance!

monniert commented 3 years ago

Hi @CrazyCrud thanks for the interest in the project! Here are some answers:

Hope this helps

CrazyCrud commented 3 years ago

@monniert thank you very much for your detailed answer!

I would recommend annotating the x-height representation only (wiki) for example using VIA annotator [...] after conversion, with morphological operations

This sounds like a reasonable approach as you explained how to use erosion to generate colored borders in the other issue.

There are two options to finetune it

So independently of the list of labels, it always seems to be a good idea to use the pretrained model.

I'll now give it a try now and finetune the model. Your answer was very helpful.