Closed Jossi2 closed 4 years ago
Thank you for your comment. The linked pages make some points clear. Still I'm at a loss to understand why
Tesseract still supports the legacy recognizer.
Usage:
tesseract --help | --help-extra | --help-psm | --help-oem | --version
tesseract --list-langs [--tessdata-dir PATH]
tesseract --print-parameters [options...] [configfile...]
tesseract imagename|imagelist|stdin outputbase|stdout [options...] [configfile...]
OCR options:
--tessdata-dir PATH Specify the location of tessdata path.
--user-words PATH Specify the location of user words file.
--user-patterns PATH Specify the location of user patterns file.
--dpi VALUE Specify DPI for input image.
-l LANG[+LANG] Specify language(s) used for OCR.
-c VAR=VALUE Set value for config variables.
Multiple -c arguments are allowed.
--psm NUM Specify page segmentation mode.
--oem NUM Specify OCR Engine mode.
OCR Engine modes:
0 Legacy engine only.
1 Neural nets LSTM engine only.
2 Legacy + LSTM engines.
3 Default, based on what is available.
As mentioned by @stweil in https://github.com/tesseract-ocr/tesstrain/wiki/Training-Fraktur
frk supports the German character set, but important characters like for example § are missing and will never be recognized. In addition, some ligatures like ch and ck were trained wrongly and will therefore be recognized as < and >. script/Fraktur supports a larger international character set, but otherwise has the same issues as frk.
Thank you!
Strictly spoken, this is probably a Tesseract issue, not a gImageReader issue: I cannot find out which use the “script“ traineddata have that are provided with Tesseract 4 and installed alongside the language files. I use gImageReader mainly for OCR of old German documents printed in blackletter (Fraktur) as Tesseract is the only existing OCR engine giving decent results with this kind of documents. But when I choose "Fraktur [script]" from the language menu, the result is dismal, even if combined with "Deutsch [deu]“ and the respective dictionary. Dictionary spell checking simply doesn't work. To get the same results as before, I had to re-install manually the old "deu_frak.traineddata" file from the tessdat repository which is said to work only with Tesseract 3, but in fact works fine with the latest release of gImageReader. So I am asking myself: What's it all about? What are these [script] additions supposed to do?