ulb-sachsen-anhalt / ulb-zeitungsprojekt-hp1

Training data from "Hauptphase I" of project "Digitalisierung historischer deutscher Zeitungen"
12 stars 1 forks source link

binarized or raw grayscale? #4

Open bertsky opened 1 year ago

bertsky commented 1 year ago

I was wondering whether the provided images underwent some kind of preprocessing (denoising / normalization). Then I stumbled over this step in the training script:

https://github.com/ulb-sachsen-anhalt/ulb-zeitungsprojekt-hp1/blob/677f4ec92d59b569abcaf70944df965ccd30a0f4/00-prepare.sh#L125-L126

Does that mean that the original scans have actually been binarized first? (And I guess the wider question is which kind of images your provided model can be expected to work best with.)

(Even if not,) Have you made any experiments regarding raw vs. bin training?

M3ssman commented 1 year ago

Nope, sorry. It's an artifact from several tries to also re-use OCR-PAGE produced by the OCR-D-Workflows of late 2019 which also produced on some stage binarized region images of this format.

The motivation was to get more training data which was also already dewarped and deskewed, because our real newspaper material has often issues regarding these.

But these early tries didn't produced that much usable output and second it turned out to be far easier to cut a complete page and throw away the 1/3 (varies depending on the actual page) which seemed to be too bad.

Back I did some experiments regarding raw (=greyscale) and binarized, but not at large scale i.e. the whole ZD1_25 corpus. If I recall correct, greyscale performed slightly better than binarized trained models, but this seemed to me quite reasonable since the got the same input for training than later for recognition.

Further more I guess both Tesseract and Lstmtraining use Leptonica for Binarization, whether one likes it or not, so it seemed also quite reasonable to have both components apply their custom way of binarization as they do.

bertsky commented 1 year ago

Understood – thanks for the clarification!

If I recall correct, greyscale performed slightly better than binarized trained models, but this seemed to me quite reasonable since the got the same input for training than later for recognition.

Oh, but surely when training on binary you need to predict on binary, and when training on greyscale you should predict on greyscale – so I don't understand that argument.

Further more I guess both Tesseract and Lstmtraining use Leptonica for Binarization, whether one likes it or not, so it seemed also quite reasonable to have both components apply their custom way of binarization as they do.

Not true though. Tesseract does not use binarization for recognition at all (at least not for LSTM models; only for legacy models and for segmentation). So binary with Tesseract always means applying binarization externally.