qurator-spk / eynollah

Document Layout Analysis
Apache License 2.0
332 stars 27 forks source link

Documentation: Should the OCR-D processor run on RGB or binarized images? #39

Closed mikegerber closed 2 years ago

mikegerber commented 3 years ago

Should the OCR-D processor run on a RGB or a binarized image input group?

I think it would be best if the README listed an example, e.g.:

ocrd-eynollah-segment -I <WHICH ONE?> -O SEG-LINE -P xyz abc

kba commented 3 years ago

Should the OCR-D processor run on a RGB

Yes it should be run on the RGB image. Conventionally we call that file group OCR-D-IMG, in DFG-Viewer conventions, the best match would be MAX IIRC

vahidrezanezhad commented 3 years ago

If eynollah is used as a layout segmenter, I would say RGB is preferred.

mikegerber commented 3 years ago

So a valid workflow would be

ocrd-eynollah-segment -I OCR-D-IMG -O OCR-D-SEG-LINE -P xyz abc
ocrd-some-binarization -I OCR-D-SEG-LINE -O OCR-D-IMG-BIN
ocrd-some-ocr -I OCR-D-IMG-BIN -O OCR-D-OCR

Correct?

mikegerber commented 3 years ago

If eynollah is used as a layout segmenter, I would say RGB is preferred.

The question was about the OCR-D processor; This is relevant because, depending on the code, a run with -I OCR-IMG-BIN (= binarized image group) could still use the RGB image by retrieving a AlternativeImage. That's why I find it crucial that the README provides an example of correct usage, with the right input file group.

mikegerber commented 3 years ago

(If RGB is preferred, the processor could also issue a warning if binarized (single-channel) input is provided)

kba commented 3 years ago

So a valid workflow would be

ocrd-eynollah-segment -I OCR-D-IMG -O OCR-D-SEG-LINE -P xyz abc
ocrd-some-binarization -I OCR-D-SEG-LINE -O OCR-D-IMG-BIN
ocrd-some-ocr -I OCR-D-IMG-BIN -O OCR-D-OCR

Correct?

Yes but so would be

[...]
ocrd-some-binarization -I OCR-D-IMG -O OCR-D-IMG-BIN
[...]

because ocrd-some-binarization should filter out binarized images, so it should not make a difference here, it will end up with the @imageFilename (if used on page-level).

could still use the RGB image by retrieving a AlternativeImage.

We do not do that though in eynollah, we're passing on the @imageFilename directly, so as long as no processor running before eynollah changes the @imageFilename (which only ocrd-preprocess-image does IIRC), it will use the RGB image anyway.

mikegerber commented 3 years ago

We do not do that though in eynollah, we're passing on the @imageFilename directly, so as long as no processor running before eynollah changes the @imageFilename (which only ocrd-preprocess-image does IIRC), it will use the RGB image anyway.

Thanks I believe that clears that part up: It's OK to do binarization before ocrd-eynollah-segment because it ends up using the RGB image anyway.

mikegerber commented 3 years ago

Sorry if I happen to sound super pedantic here, it's just not easy to see what's happening when we apparently stick in OCR-D-IMG-BIN but the processor does not actually use the images from that group. It was a source of confusion in sbb-textline-detection too.

kba commented 3 years ago

Sorry if I happen to sound super pedantic here, it's just not easy to see what's happening when we apparently stick in OCR-D-IMG-BIN but the processor does not actually use the images from that group. It was a source of confusion in sbb-textline-detection too.

No need to be sorry, it is important to document this so users are clear on which data gets passed where. This behavior (directly using @imageFilename) is also different from all other processors (except sbb-textline-detection ;-)) but I think we can get away with it, because it's very unlikely that eynollah would ever need to be run on presegmented input.

cneud commented 2 years ago

I've added this to the documentation - would this be sufficient @mikegerber?

Use as OCR-D processor

Eynollah ships with a CLI interface to be used as OCR-D processor. In this case, the source image file group with (preferably) RGB images should be used as input (in fact, the image provided by @imageFilename is passed on directly):

ocrd-eynollah-segment -I OCR-D-IMG -O SEG-LINE -P models

mikegerber commented 2 years ago

I've added this to the documentation - would this be sufficient @mikegerber?

Use as OCR-D processor Eynollah ships with a CLI interface to be used as OCR-D processor. In this case, the source image file group with (preferably) RGB images should be used as input (in fact, the image provided by @imageFilename is passed on directly): ocrd-eynollah-segment -I OCR-D-IMG -O SEG-LINE -P models

It's a bit trickier: It's fine to put in -I OCR-D-IMG-BIN but it will still use the RGB images from the step before

cneud commented 2 years ago

Ok, how about this then?

Use as OCR-D processor Eynollah ships with a CLI interface to be used as OCR-D processor. In this case, the source image file group with (preferably) RGB images should be used as input like this: ocrd-eynollah-segment -I OCR-D-IMG -O SEG-LINE -P models In fact, the image provided by @imageFilename in PAGE-XML is passed on directly to Eynollah as a processor, so that e.g. ocrd-eynollah-segment -I OCR-D-IMG-BIN -O SEG-LINE -P models will still use the original (RGB) image despite any binarization that may have occured in previous OCR-D processing steps

mikegerber commented 2 years ago

Yes this seems to be correct. I checked the source here: https://github.com/qurator-spk/eynollah/blob/main/qurator/eynollah/processor.py#L45-L57

cneud commented 2 years ago

OK I've amended this accordingly in https://github.com/qurator-spk/eynollah/commit/441c8566dda5cc2b37fd92a39236dc595a547298 and will close here once the PR for the README update has been merged.