mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
746 stars 131 forks source link

Recognizers with segmentation types set() will be applied to segmentation of type baselines. #252

Closed rohanchn closed 3 years ago

rohanchn commented 3 years ago

Hi,

I trained a segmentation model using kraken's ketos segtrain command with a bunch of page xml files as input that I annotated in eScriptorium. The segmentation model performs well as I can see by using it in eScriptorium to segment scans it hasn't seen before.

However, when I try to apply it to ocr scans in kraken using the following command for i in *.png; do kraken -i $i ${i%.png}.txt segment -i <seg_model> -bl ocr -m <ocr_model>; done I get the following warning: Loading ANN withregion27_47.mlmodel ✓ Loading ANN default ✓ Segmenting ✓ [13.8289] Recognizers with segmentation types set() will be applied to segmentation of type baselines. This will likely result in severely degraded performace WARNING:kraken.rpred:Recognizers with segmentation types set() will be applied to segmentation of type baselines. This will likely result in severely degraded performace Processing [####################################] 100% Writing recognition results for 112.png ✓ This text recognition is considerably bad as well in comparison to text recognition with default segmentation. I can't rely on the default segmentation either since it is not giving me satisfactory line segmentation.

How can I improve this? Could you please help?

rohanchn commented 3 years ago

kraken, version 3.0.0.0b25

rohanchn commented 3 years ago

Not sure about the types. https://github.com/mittagessen/kraken/blob/b99a6d0374dfca3a8ff5e273637e31bb7d6dfd75/kraken/rpred.py#L174-L179

mittagessen commented 3 years ago

You're most likely using an old recognition model that has been trained on bounding box data. These will produce quite a bit worse results on grayscale, baseline data. Unfortunately, you'll have to train a new model.

rohanchn commented 3 years ago

Right, the recognition model was prepared on bounding box data, and the page xml I am using for ketos segtrain do not have Unicode text. I was hoping to segment with a custom model and recognise texts using my existing recognition model in the same pipeline. I trained my existing recognition model on 4.4k lines from varied sources (historical print). Is there a way to reuse gt for this model? The recognition is fairly satisfactory, and I also hope to refine it further. It performs well on single column documents with no illustrations etc.

The page xml that I fed to ketos segtrain do not have Unicode text. What would be the best way to train a new recognition model for this? Should I prepare page xml with Unicode text in eScriptorium and use those files to train for both line segmentation (ketos segtrain) and text recognization (ketos train) separately in kraken?

I really appreciate your help.

mittagessen commented 3 years ago

It doesn't have to be the same data but a recognition model trained on baseline data from escriptorium would most likely be the easiest way. The masking of input data and preprocessing works a bit different between the two formats so the network learns differently (and incompatibly).

Someone was working on a converter for bounding box to baseline format but if I remember correctly nothing has come of it yet.

rohanchn commented 3 years ago

I can use my recognition model to ocr the pages I have already annotated in eScriptorium, fill in the transcriptions there, and then train in kraken.

Thank you for clarifying.

rohanchn commented 3 years ago

So, I had a few page xml files (18) with transcription aligned with baseline from eScriptorium, and I tried to train a recognition model using it. These are not a lot of lines, but I wanted to test.

When I used this new baseline based recognition model alongside the segmentation model, I still got the aforementioned warning. My command is this: for i in *.png; do kraken -i $i ${i%.png}.txt segment -bl -i wr48_33.mlmodel ocr -m 1new_best.mlmodel; done

I used this to train the text recognizer in kraken: ketos train -f page -d cuda:0 -o 1new *.xml

I think I am still missing something?