jwilk-archive / ocrodjvu

OCR for DjVu
GNU General Public License v2.0
45 stars 19 forks source link

adopt IETF language tags (BCP 47) #33

Open jwilk opened 5 years ago

jwilk commented 5 years ago

We should use IETF language tags (BCP 47) instead of ISO 639-2 codes, or non-standard names Tesseract uses.

jwilk commented 2 years ago

Here's a (partial) mapping between Tesseract script names and ISO 15924 script codes:

Tesseract ISO 15924
Arabic Arab
Armenian Armn
Bengali Beng
Canadian_Aboriginal Cans
Cherokee Cher
Cyrillic Cyrl
Devanagari Deva
Ethiopic Ethi
Fraktur Latf
Georgian Geor
Greek Grek
Gujarati Gujr
Gurmukhi Guru
HanS Hans
HanT Hant
Hangul Hang
Hebrew Hebr
Japanese Jpan
Kannada Knda
Khmer Khmr
Lao Laoo
Latin Latn
Malayalam Mlym
Myanmar Mymr
Oriya Orya
Sinhala Sinh
Syriac Syrc
Tamil Taml
Telugu Telu
Thaana Thaa
Thai Thai
Tibetan Tibt

The table doesn't cover:

jwilk commented 2 years ago

Most Tesseract language code are either ISO 639-2 or ISO 693-3 codes, possibly with some non-standard suffixes.

Here's a mapping between Tesseract suffixes and ISO 15924 script codes:

Tesseract ISO 15924
ara Arab
cyrl Cyrl
frak Latf
latn Latn
sim Hans
tra Hant

It's not clear how to map these: