Open bertsky opened 3 years ago
Or maybe I should've read the documentation: So training would be via sbb_pixelwise_segmentation, correct?
Those are fine and valid questions. Our current order of priorities is roughly like this: refactoring codebase -> OCR-D integration -> improve documentation -> publication.
Training is indeed done via sbb_pixelwise_segmentation and I believe you can e.g. use page2img if you already have accordingly labeled regions as PAGE-XML. But of course @vahidrezanezhad will be able to explain details much better than I can.
Or maybe I should've read the documentation: So training would be via sbb_pixelwise_segmentation, correct?
In eynollah alongside layout, textline and page detection which are pixelwise segmentation and done with https://github.com/qurator-spk/sbb_pixelwise_segmentation , we have an image enhancer which is in fact a pixelwise regressor so I hvae to integrate it in https://github.com/qurator-spk/sbb_pixelwise_segmentation. The other model we have in eynollah is a scale classifier, corresponding model trainer ( which is an encoder with two dense layer at top of it) should be published too.
It would be awesome if some or all models used throughout eynollah's workflow could be adapted to other domains by providing the tools for training. Ideally this would be complemented with some documentation – but I assume you will publish academically on this approach sooner or later?
Specifically, it would be great if one could integrate detection of additional types of regions (provided there's some suitable structural GT), like:
- vertical text
- tables
- handwriting
- signatures
- stamps
- music scores
- maps
- image subclasses (figures, illustrations, photographs)
Or is there perhaps going to be some way of running incremental annotation here (i.e. masking certain areas of the page, so these can be segmented externally)? (Or is this already possible via API?)
Starting with suggested features, I have to say that the vertical textlines are already detectable to some extent by eynollah.
Here in SBB and by qurator project we have used or better to say exploited layout in order to get a better OCR with an optimal reading order. So considering this we have tried to first detect text regions and all elements which are needed for OCR.
Detecting the rest of mentioned elements is of course possible but we need to have a great GT :)
Thanks @vahidrezanezhad @cneud for getting back to me so quickly!
Those are fine and valid questions. Our current order of priorities is roughly like this: refactoring codebase -> OCR-D integration -> improve documentation -> publication.
Fantastic!
Training is indeed done via sbb_pixelwise_segmentation and I believe you can e.g. use page2img if you already have accordingly labeled regions as PAGE-XML.
Oh, thanks! I am beginning to get an understanding...
When you get to the documentation or publication phase, please don't forget to mention which config_params
you did use for what kind of data.
In eynollah alongside layout, textline and page detection which are pixelwise segmentation and done with https://github.com/qurator-spk/sbb_pixelwise_segmentation ,
Interesting – so you really did stick to dhSegment's idea of using the same paradigm for different sub-tasks. Do these models also get to share some early layer weights, or are used to initialize one another in the training procedure?
For the pixel classifier specific problem of gluing nearby neighbours, especially between columns horizontally during layout and between lines vertically during textline segmentation, do you have any additional tricks to incentivise the network for separation? Like extra classes with larger weights around the contour of objects, or a specialised loss function?
we have an image enhancer which is in fact a pixelwise regressor so I hvae to integrate it in https://github.com/qurator-spk/sbb_pixelwise_segmentation. The other model we have in eynollah is a scale classifier, corresponding model trainer ( which is an encoder with two dense layer at top of it) should be published too.
Fascinating! While I don't think these two components would be candidates for domain adaptation, maybe we could wrap these as separate steps in OCR-D, so one could re-use their intermediate results in combination with other tools and workflows. For example, the enhanced images might help OCR proper, or the scale map could help other segmentation tools. It's probably hard to see the benefit right away, as this is very innovative and one might need to re-train elsewhere. But I would very much welcome the analytical advantage of this sub-tasking...
Starting with suggested features, I have to say that the vertical textlines are already detectable to some extent by eynollah.
Here in SBB and by qurator project we have used or better to say exploited layout in order to get a better OCR with an optimal reading order. So considering this we have tried to first detect text regions and all elements which are needed for OCR.
I am not sure I understand this correctly. What I meant by vertical text
was text regions that contain text lines which align vertically instead of horizontally. From the visual (non-textual) side, this covers both vertical writing (traditional Chinese / Japanese script) and rotated (90° / 270°) horizontal writing. Identifying this early-on would allow doing text line segmentation in some vertical fashion, too.
But your explanation seems more concerned with reading order detection (order of text regions in page, order of text lines in regions).
Anyway, here's a sample where eynollah does not seem to cope with vertical text yet (except in the 1 line reading Candide ou l'Optimisme
):
Thanks @vahidrezanezhad @cneud for getting back to me so quickly!
Those are fine and valid questions. Our current order of priorities is roughly like this: refactoring codebase -> OCR-D integration -> improve documentation -> publication.
Fantastic!
Training is indeed done via sbb_pixelwise_segmentation and I believe you can e.g. use page2img if you already have accordingly labeled regions as PAGE-XML.
Oh, thanks! I am beginning to get an understanding...
When you get to the documentation or publication phase, please don't forget to mention which
config_params
you did use for what kind of data.In eynollah alongside layout, textline and page detection which are pixelwise segmentation and done with https://github.com/qurator-spk/sbb_pixelwise_segmentation ,
Interesting – so you really did stick to dhSegment's idea of using the same paradigm for different sub-tasks. Do these models also get to share some early layer weights, or are used to initialize one another in the training procedure?
For the pixel classifier specific problem of gluing nearby neighbours, especially between columns horizontally during layout and between lines vertically during textline segmentation, do you have any additional tricks to incentivise the network for separation? Like extra classes with larger weights around the contour of objects, or a specialised loss function?
we have an image enhancer which is in fact a pixelwise regressor so I hvae to integrate it in https://github.com/qurator-spk/sbb_pixelwise_segmentation. The other model we have in eynollah is a scale classifier, corresponding model trainer ( which is an encoder with two dense layer at top of it) should be published too.
Fascinating! While I don't think these two components would be candidates for domain adaptation, maybe we could wrap these as separate steps in OCR-D, so one could re-use their intermediate results in combination with other tools and workflows. For example, the enhanced images might help OCR proper, or the scale map could help other segmentation tools. It's probably hard to see the benefit right away, as this is very innovative and one might need to re-train elsewhere. But I would very much welcome the analytical advantage of this sub-tasking...
Starting with suggested features, I have to say that the vertical textlines are already detectable to some extent by eynollah.
Here in SBB and by qurator project we have used or better to say exploited layout in order to get a better OCR with an optimal reading order. So considering this we have tried to first detect text regions and all elements which are needed for OCR.
I am not sure I understand this correctly. What I meant by
vertical text
was text regions that contain text lines which align vertically instead of horizontally. From the visual (non-textual) side, this covers both vertical writing (traditional Chinese / Japanese script) and rotated (90° / 270°) horizontal writing. Identifying this early-on would allow doing text line segmentation in some vertical fashion, too.But your explanation seems more concerned with reading order detection (order of text regions in page, order of text lines in regions).
Anyway, here's a sample where eynollah does not seem to cope with vertical text yet (except in the 1 line reading
Candide ou l'Optimisme
):
Just clarify that we did have only latin scripted documents, so clearly it will not work for chinese or arabic ones. I meant that if in your document you have vertical lines (of course visually), they are detectable to some extent :)
Just clarify that we did have only latin scripted documents, so clearly it will not work for chinese or arabic ones. I meant that if in your document you have vertical lines (of course visually), they are detectable to some extent :)
Thanks for clarifying and illustrating! So is there any special representation (during training and/or prediction) for vertical text, either in the textline model or the region model? Or would one merely need to mix in some vertical regions/lines, for example from HJDataset?
For the current tool (and of course talking about latin scripted docs), in order to detect vertical lines we had to first train a textline model which is able to detect vertical lines. This is achieved by feeding model with 90 degree rotated documents too. And then we still needed to calculate the angel of deskewing for region to see is it a vertical or not.
In the case of chinese (orjapanese) documenst ( which are written vertically) of course we have to train model with your mentioned data. But I am not sure that how would be texttline detector if we mix latin and chinese for training. I mean the quality of mixed trained model in compare to the model trained only with latin scripts. We need to just give a try :)
It would be awesome if some or all models used throughout eynollah's workflow could be adapted to other domains by providing the tools for training. Ideally this would be complemented with some documentation – but I assume you will publish academically on this approach sooner or later?
Specifically, it would be great if one could integrate detection of additional types of regions (provided there's some suitable structural GT), like:
Or is there perhaps going to be some way of running incremental annotation here (i.e. masking certain areas of the page, so these can be segmented externally)? (Or is this already possible via API?)