tesseract doesn't recognize ISO639 code "zho" for chinese

Seegras commented 4 years ago

Environment

Tesseract Version: 4.1.1-2
Commit Number: Probably 7274cfa
Platform: Linux 5.6.14 #1 SMP PREEMPT Mon May 25 08:48:09 CEST 2020 x86_64 GNU/Linux

Current Behavior:

Error opening data file /usr/share/tesseract-ocr/4.00/tessdata/zho.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language 'zho' Tesseract couldn't load any languages! Failed to initialize tesseract (OCR).

Explanation

This comes via vobsub2srt and is expanded from an .idx-file which says: id: zh, index: 0 This gets expanded to zho instead of chi; but this is still valid ISO639 for chi.

There are actually several languages which have multiple ISO639-codes, like welsh (wel and cym) which might suffer the same problem.

Expected Behavior:

Recognizes this and uses some chi_*.traineddata

Suggested Fix:

amitdo commented 4 years ago

tesseract doesn't recognize ISO639 code "zho" for chinese

It also does not recognize eng for English...

Tesseract knows nothing about ISO639. It will get any language name you give it, append .traineddata to it, prepend the tessdata dir to the filename and try to open it.

amitdo commented 4 years ago

Also zh/zho does not distinguish between Traditional and Simplified Chinese. For this there are Hant and Hans in ISO 639 which are appended to zh/zho.

ISO 639 does not seem to have anything that can signal vertical vs horizontal writing.

tesseract-ocr / tesseract