manisandro / gImageReader

A Gtk/Qt front-end to tesseract-ocr.
GNU General Public License v3.0
1.61k stars 189 forks source link

Use of "script" traineddata #462

Closed Jossi2 closed 4 years ago

Jossi2 commented 4 years ago

Strictly spoken, this is probably a Tesseract issue, not a gImageReader issue: I cannot find out which use the “script“ traineddata have that are provided with Tesseract 4 and installed alongside the language files. I use gImageReader mainly for OCR of old German documents printed in blackletter (Fraktur) as Tesseract is the only existing OCR engine giving decent results with this kind of documents. But when I choose "Fraktur [script]" from the language menu, the result is dismal, even if combined with "Deutsch [deu]“ and the respective dictionary. Dictionary spell checking simply doesn't work. To get the same results as before, I had to re-install manually the old "deu_frak.traineddata" file from the tessdat repository which is said to work only with Tesseract 3, but in fact works fine with the latest release of gImageReader. So I am asking myself: What's it all about? What are these [script] additions supposed to do?

Shreeshrii commented 4 years ago

See https://github.com/tesseract-ocr/tesstrain/wiki/Training-Fraktur

and

https://github.com/tesseract-ocr/tesstrain/wiki

Jossi2 commented 4 years ago

Thank you for your comment. The linked pages make some points clear. Still I'm at a loss to understand why

  1. script/Fraktur gives a completely unusable result
  2. deu_frak.traineddata works with the current release of gImageReader although this is based on Tesseract 4 and deu_frak.traineddata isn't supported by the LSTM recognizer. Does Tesseract 4 still contain the legacy recognizer as a fallback?
Shreeshrii commented 4 years ago

Tesseract still supports the legacy recognizer.

Usage:
  tesseract --help | --help-extra | --help-psm | --help-oem | --version
  tesseract --list-langs [--tessdata-dir PATH]
  tesseract --print-parameters [options...] [configfile...]
  tesseract imagename|imagelist|stdin outputbase|stdout [options...] [configfile...]

OCR options:
  --tessdata-dir PATH   Specify the location of tessdata path.
  --user-words PATH     Specify the location of user words file.
  --user-patterns PATH  Specify the location of user patterns file.
  --dpi VALUE           Specify DPI for input image.
  -l LANG[+LANG]        Specify language(s) used for OCR.
  -c VAR=VALUE          Set value for config variables.
                        Multiple -c arguments are allowed.
  --psm NUM             Specify page segmentation mode.
  --oem NUM             Specify OCR Engine mode.

OCR Engine modes:
  0    Legacy engine only.
  1    Neural nets LSTM engine only.
  2    Legacy + LSTM engines.
  3    Default, based on what is available.

As mentioned by @stweil in https://github.com/tesseract-ocr/tesstrain/wiki/Training-Fraktur

frk supports the German character set, but important characters like for example § are missing and will never be recognized. In addition, some ligatures like ch and ck were trained wrongly and will therefore be recognized as < and >. script/Fraktur supports a larger international character set, but otherwise has the same issues as frk.

Jossi2 commented 4 years ago

Thank you!