mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
672 stars 125 forks source link

Decreased accuracy in Kraken 5.x compared to same setup in Kraken 4.x with Arabic language model #589

Open bmwmy opened 2 months ago

bmwmy commented 2 months ago

Hi I tried the same page with same setup with both Kraken 5.x and Kraken 4.x with provided Arabic_best.ml and there is more errors in the latest version (5.x) I think this relate to changes in segmenter which now been modified to allow curly segments which is probably not good for Arabic (I cannot find the issue #).

mittagessen commented 2 months ago

Can you show me which commands exactly you're running + could you give an XML file + image where this occurs? The segmentation has not changed, only the line extraction before feeding into the recognizer is new. It is disabled by default for pre-5.0 models though so I'm wondering where your issues come from.

bmwmy commented 2 months ago

this is the command kraken -i "yarab_deskewed.png" "yarab.txt" segment -bl ocr -m arabic_best.mlmodel Kraken_Dated_07-09-2022.pdf Kraken_4.13.20.pdf kraken_5dev23.pdf

yarab_deskewed (the original file being OCRed)

in every major update in kraken, decreased accuracy being noted