qurator-spk / sbb_textline_detection

Detect textlines in document images
Apache License 2.0
88 stars 18 forks source link

TextLine coordinates too coarse #33

Closed bertsky closed 4 years ago

bertsky commented 4 years ago

Would it be possible to get good polygonal outlines from the text line segmentation instead of coarse bounding boxes?

There is a stark contrast between the precise contours of the text regions (which never overlap) and the coarse rectangles of text lines inside them (which often extrude beyond their parent and overlap between adjacent lines).

This makes it risky to apply line-level dewarping afterwards, and requires an OCR engine that can cope with intruders in the line image. In the example given in #29, I get these line images from ocrd-cis-ocropy-dewarp:

OCR-D-IMG-DEW-SBB_0001_r21_l24

OCR-D-IMG-DEW-SBB_0001_r21_l25

OCR-D-IMG-DEW-SBB_0001_r21_l26

OCR-D-IMG-DEW-SBB_0001_r21_l27

OCR-D-IMG-DEW-SBB_0001_r21_l28

OCR-D-IMG-DEW-SBB_0001_r21_l29

OCR-D-IMG-DEW-SBB_0001_r21_l30

OCR-D-IMG-DEW-SBB_0001_r21_l31

OCR-D-IMG-DEW-SBB_0001_r21_l32

OCR-D-IMG-DEW-SBB_0001_r21_l33

OCR-D-IMG-DEW-SBB_0001_r21_l34

vahidrezanezhad commented 4 years ago

@bertsky We can set more tight textlines but this also has its own disadvantages. By the way we will publish a new tool which throws contours for textlines not rectangles. however mentioned method costs us more processing time!

bertsky commented 4 years ago

@vahidrezanezhad

We can set more tight textlines but this also has its own disadvantages. however mentioned method costs us more processing time!

Then why not make that behaviour optional (with an ocrd-tool.json parameter), so the user can decide what is needed (precision or performance) for her workflow?

By the way we will publish a new tool which throws contours for textlines not rectangles.

Where?

And why did you close the issue already?

vahidrezanezhad commented 4 years ago

Dear @bertsky , First of all you can see the tool which gives texlines as contour here " https://github.com/vahidrezanezhad/newspapers_regions_and_reading_order_curved_lines " But the reason it is not integrated as an option to the current model is that, the new tool will be another tool which can give also the reading order of textregions. The other reason is it is still under development. If you use this tool (of course I can share the models with you :) ) you will see that I am writing textlines contours on the deskewed image and not original image, but based on our internal decisions in sbb we decided to write results on org image again.

bertsky commented 4 years ago

@vahidrezanezhad understood – I'll try to follow. Thanks for clarifying!