Closed CrazyCrud closed 3 years ago
Hi @CrazyCrud thanks for the interest in the project! Here are some answers:
src/extractor.py
) so that's why they don't show upillustration
, text
and text_border
. There are two options to finetune it: (i) you care about extracting all these elements so you keep the same labels (colors) in your GT and finetuning is straightforward or (ii) you want to finetune on a different list of labels (completely different or a subset, in your case text
and text_border
), in that case the final conv1x1 layer would be randomly initialized but you will still strongly benefit from the rest of the pretrained network. The latter (ii) is the one performed to report the finetuned results on the baseline detection benchmarks (cBADs, table 1 and 2 in the paper)Hope this helps
@monniert thank you very much for your detailed answer!
I would recommend annotating the x-height representation only (wiki) for example using VIA annotator [...] after conversion, with morphological operations
This sounds like a reasonable approach as you explained how to use erosion to generate colored borders in the other issue.
There are two options to finetune it
So independently of the list of labels, it always seems to be a good idea to use the pretrained model.
I'll now give it a try now and finetune the model. Your answer was very helpful.
First of all, thank you very much for your work! docExtractor extracts text lines very well out of the box.
But I want to fine-tune the model with custom data and I think my question is related to the process of creating the GT. As you stated in #10 you recommend to add some border around the annotated text lines. As I went through the examples on https://enherit.paris.inria.fr/ it seems that borders are not annotated explicitly.
Moreover I'm not quite sure which labels where used when the model has been trained. On https://enherit.paris.inria.fr/ the text lines are labeled as text but the paper states labels like paragraph or table.
In my case I work with tabular data. Should text inside the cells therefore be labeled as table?
Thank you in advance!