replace PrintSpace with Border

qurator-spk / sbb_textline_detection

Detect textlines in document images

Apache License 2.0

90 stars 18 forks source link

replace PrintSpace with Border #29

Closed bertsky closed 4 years ago

bertsky commented 4 years ago

I'm not sure if this is intentional, but currently the textline detector uses the PrintSpace element for the outer hull of all detected regions. Shouldn't that be Border instead?

See also:

Here are the two affected places:

https://github.com/qurator-spk/sbb_textline_detection/blob/d36b01591d2328fc03f2956ff98b66e50a5f81f5/qurator/sbb_textline_detector/main.py#L1941

https://github.com/qurator-spk/sbb_textline_detection/blob/d36b01591d2328fc03f2956ff98b66e50a5f81f5/qurator/sbb_textline_detector/ocrd_cli.py#L83-L85

cneud commented 4 years ago

Thanks, I would prefer to be in line with current @OCR-D conventions here, but let me check this with @vahidrezanezhad @mikegerber (see also PrintSpaceType vs BorderType)

bertsky commented 4 years ago

Note, also affects the tool's problem statement in the README:

This tool performs printspace, region and textline detection

How about:

This tool performs page frame, text region and text line detection (i.e. page cropping, page segmentation and line segmentation)

BTW, this also makes me wonder how well your method can cope with textual noise (like facing pages). Even the (rule-based) DFKI tool has problems with pages like this where the gutter is non-vertical...

cneud commented 4 years ago

Nice example ;) I don't think our current model is very optimized for that yet, but @vahidrezanezhad is already experimenting with new models for e.g. curved lines.

I can however remember 2007 and 2010 IMPACT NCSR implementations being able to deal with such pages.

bertsky commented 4 years ago

I don't think our current model is very optimized for that yet

You were right. It does detect that the right side does not belong to the page – sort of – but then uses a bbox which forces parts of the noise into the page frame again: OCR-D-CROP-SBB_0001

Maybe this is just a matter of using polygons instead of bboxes?