Open jwilk opened 5 years ago
Here's a (partial) mapping between Tesseract script names and ISO 15924 script codes:
Tesseract | ISO 15924 |
---|---|
Arabic |
Arab |
Armenian |
Armn |
Bengali |
Beng |
Canadian_Aboriginal |
Cans |
Cherokee |
Cher |
Cyrillic |
Cyrl |
Devanagari |
Deva |
Ethiopic |
Ethi |
Fraktur |
Latf |
Georgian |
Geor |
Greek |
Grek |
Gujarati |
Gujr |
Gurmukhi |
Guru |
HanS |
Hans |
HanT |
Hant |
Hangul |
Hang |
Hebrew |
Hebr |
Japanese |
Jpan |
Kannada |
Knda |
Khmer |
Khmr |
Lao |
Laoo |
Latin |
Latn |
Malayalam |
Mlym |
Myanmar |
Mymr |
Oriya |
Orya |
Sinhala |
Sinh |
Syriac |
Syrc |
Tamil |
Taml |
Telugu |
Telu |
Thaana |
Thaa |
Thai |
Thai |
Tibetan |
Tibt |
The table doesn't cover:
HanS_vert
, HanT_vert
, Hangul_vert
, Japanese_vert
Vietnamese
Most Tesseract language code are either ISO 639-2 or ISO 693-3 codes, possibly with some non-standard suffixes.
Here's a mapping between Tesseract suffixes and ISO 15924 script codes:
Tesseract | ISO 15924 |
---|---|
ara |
Arab |
cyrl |
Cyrl |
frak |
Latf |
latn |
Latn |
sim |
Hans |
tra |
Hant |
It's not clear how to map these:
chi_sim_vert
, chi_tra_vert
, jpn_vert
, kor_vert
spa_old
, kat_old
, ita_old
equ
("Math / equation detection module")osd
("Orientation and script detection module")
We should use IETF language tags (BCP 47) instead of ISO 639-2 codes, or non-standard names Tesseract uses.