tesseract-ocr / tessdoc

Tesseract documentation
https://tesseract-ocr.github.io/tessdoc/
1.84k stars 363 forks source link

What should be the norm_mode for different languages? #99

Open girikum opened 1 year ago

girikum commented 1 year ago

I see that the norm_mode is defined as the following values in https://github.com/tesseract-ocr/tesseract/blob/master/src/training/unicharset_extractor.cpp#L103

1 - combine graphemes (use for Latin and other simple scripts) 2 - split graphemes (use for Indic/Khmer/Myanmar) 3 - pure unicode (use for Arabic/Hebrew/Thai/Tibetan)

Can someone clarify in the documentation the exact mapping for the all the available languages in the tessdata repos?

It is pretty confusing to me that the NORM_MODE defined in the tesstrain Makefile almost never uses the values for Latin languages. https://github.com/tesseract-ocr/tesstrain/blob/main/Makefile#L86-L101

Should norm_mode be 2 even for English according to the Makefile?