1 - combine graphemes (use for Latin and other simple scripts)
2 - split graphemes (use for Indic/Khmer/Myanmar)
3 - pure unicode (use for Arabic/Hebrew/Thai/Tibetan)
Can someone clarify in the documentation the exact mapping for the all the available languages in the tessdata repos?
I see that the norm_mode is defined as the following values in https://github.com/tesseract-ocr/tesseract/blob/master/src/training/unicharset_extractor.cpp#L103
1 - combine graphemes (use for Latin and other simple scripts) 2 - split graphemes (use for Indic/Khmer/Myanmar) 3 - pure unicode (use for Arabic/Hebrew/Thai/Tibetan)
Can someone clarify in the documentation the exact mapping for the all the available languages in the tessdata repos?
It is pretty confusing to me that the NORM_MODE defined in the tesstrain Makefile almost never uses the values for Latin languages. https://github.com/tesseract-ocr/tesstrain/blob/main/Makefile#L86-L101
Should norm_mode be 2 even for English according to the Makefile?