Please add the API call to translate the language code to the full language name

yurivict commented 6 months ago

Your Feature Request

Functions like GetAvailableLanguagesAsVector return language codes. There's a page listing all languages with their full names, but these full names don't seem to be available through the API.

Could you please add an API call that would return the full language name for the language code?

Thank you, Yuri

stweil commented 6 months ago

Most language codes are ISO 639-2 codes. Use ICU4C to translate such names.

Code for language_code_to_name.cpp:

#include <iostream>
#include <unicode/locid.h>

std::string getLanguageFullName(const std::string& languageCode) {
    icu::UnicodeString lc = languageCode.c_str();
    icu::Locale locale(languageCode.c_str());
    icu::UnicodeString ln = locale.getDisplayName(lc);
    std::string s;
    ln.toUTF8String(s);
    return s;
}

int main(int argc, char *argv[]) {
    std::string languageCode = argv[1];
    std::string languageName = getLanguageFullName(languageCode);
    std::cout << languageName << std::endl;
    return 0;
}

Compile it with g++ -o language_code_to_name language_code_to_name.cpp -licui18n -licuuc -licudata.

Then run it with all traineddata files:

for l in $(ls *.traineddata|sed s/.traineddata//); do echo $l - $(LANG=C.UTF-8 ./language_code_to_name $l); done
afr - Afrikaans
amh - Amharic
ara - Arabic
asm - Assamese
aze_cyrl - Azerbaijani (Cyrillic)
aze - Azerbaijani
bel - Belarusian
[...]
tgk - Tajik
tha - Thai
tir - Tigrinya
ton - Tongan
tur - Turkish
uig - Uyghur
ukr - Ukrainian
urd - Urdu
uzb_cyrl - Uzbek (Cyrillic)
uzb - Uzbek
vie - Vietnamese
yid - Yiddish
yor - Yoruba

The same program can also show the full language names in French, German, Italian, Spanish or other languages. Only for equ, frk and osd it won't show a full language name because those names are not ISO names.

Therefore I don't think that Tesseract should add that API call.

yurivict commented 6 months ago

Thank you for the comprehensive answer and the demo program. I agree with you that Tesseract doesn't need that API call.

stweil commented 6 months ago

Regarding frk.traineddata, it looks like the ISO code should be deu_latf. Then the full language name German (Fraktur Latin) can be derived automatically.

tesseract-ocr / tesseract

Please add the API call to translate the language code to the full language name #4201

Your Feature Request