Closed yurivict closed 6 months ago
Most language codes are ISO 639-2 codes. Use ICU4C to translate such names.
Code for language_code_to_name.cpp:
#include <iostream>
#include <unicode/locid.h>
std::string getLanguageFullName(const std::string& languageCode) {
icu::UnicodeString lc = languageCode.c_str();
icu::Locale locale(languageCode.c_str());
icu::UnicodeString ln = locale.getDisplayName(lc);
std::string s;
ln.toUTF8String(s);
return s;
}
int main(int argc, char *argv[]) {
std::string languageCode = argv[1];
std::string languageName = getLanguageFullName(languageCode);
std::cout << languageName << std::endl;
return 0;
}
Compile it with g++ -o language_code_to_name language_code_to_name.cpp -licui18n -licuuc -licudata
.
Then run it with all traineddata files:
for l in $(ls *.traineddata|sed s/.traineddata//); do echo $l - $(LANG=C.UTF-8 ./language_code_to_name $l); done
afr - Afrikaans
amh - Amharic
ara - Arabic
asm - Assamese
aze_cyrl - Azerbaijani (Cyrillic)
aze - Azerbaijani
bel - Belarusian
[...]
tgk - Tajik
tha - Thai
tir - Tigrinya
ton - Tongan
tur - Turkish
uig - Uyghur
ukr - Ukrainian
urd - Urdu
uzb_cyrl - Uzbek (Cyrillic)
uzb - Uzbek
vie - Vietnamese
yid - Yiddish
yor - Yoruba
The same program can also show the full language names in French, German, Italian, Spanish or other languages.
Only for equ
, frk
and osd
it won't show a full language name because those names are not ISO names.
Therefore I don't think that Tesseract should add that API call.
Thank you for the comprehensive answer and the demo program. I agree with you that Tesseract doesn't need that API call.
Regarding frk.traineddata, it looks like the ISO code should be deu_latf
. Then the full language name German (Fraktur Latin)
can be derived automatically.
Your Feature Request
Functions like
GetAvailableLanguagesAsVector
return language codes. There's a page listing all languages with their full names, but these full names don't seem to be available through the API.Could you please add an API call that would return the full language name for the language code?
Thank you, Yuri