Open Seegras opened 4 years ago
tesseract doesn't recognize ISO639 code "zho" for chinese
It also does not recognize eng for English...
Tesseract knows nothing about ISO639. It will get any language name you give it, append .traineddata to it, prepend the tessdata dir to the filename and try to open it.
Also zh/zho does not distinguish between Traditional and Simplified Chinese. For this there are Hant
and Hans
in ISO 639 which are appended to zh/zho.
ISO 639 does not seem to have anything that can signal vertical vs horizontal writing.
Environment
Current Behavior:
Error opening data file /usr/share/tesseract-ocr/4.00/tessdata/zho.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language 'zho' Tesseract couldn't load any languages! Failed to initialize tesseract (OCR).
Explanation
This comes via vobsub2srt and is expanded from an .idx-file which says: id: zh, index: 0 This gets expanded to zho instead of chi; but this is still valid ISO639 for chi.
There are actually several languages which have multiple ISO639-codes, like welsh (wel and cym) which might suffer the same problem.
Expected Behavior:
Recognizes this and uses some chi_*.traineddata
Suggested Fix: