meh / ruby-tesseract-ocr

A Ruby wrapper library to the tesseract-ocr API.
629 stars 74 forks source link

Tesseract::API.to_language_codes output is incorrect #41

Open knowtheory opened 10 years ago

knowtheory commented 10 years ago
tesseract --list-langs 2>&1 | ruby -r 'tesseract' -e 'puts "ok?, code, api"; STDIN.read.split("\n").map{ |code| res = Tesseract::API.to_language_code(code); puts "#{code == res}, #{code}, #{res}" }'
ok?, code, api
true, List of available languages (69):, List of available languages (69):
true, afr, afr
true, ara, ara
true, aze, aze
true, bel, bel
true, ben, ben
true, bul, bul
true, cat, cat
false, ces, cze
true, chi_sim, chi_sim
true, chi_tra, chi_tra
true, chr, chr
true, dan-frak, dan-frak
true, dan, dan
true, deu-frak, deu-frak
false, deu, ger
false, ell, gre
true, eng, eng
true, enm, enm
true, epo, epo
true, epo_alt, epo_alt
true, equ, equ
true, est, est
false, eus, baq
true, fin, fin
false, fra, fre
true, frk, frk
true, frm, frm
true, glg, glg
true, grc, grc
true, heb, heb
true, hin, hin
true, hrv, hrv
true, hun, hun
true, ind, ind
false, isl, ice
true, ita, ita
true, ita_old, ita_old
true, jpn, jpn
true, kan, kan
true, kor, kor
true, lav, lav
true, lit, lit
true, mal, mal
false, mkd, mac
true, mlt, mlt
false, msa, may
false, nld, dut
true, nor, nor
true, osd, osd
true, pol, pol
true, por, por
false, ron, rum
true, rus, rus
true, slk-frak, slk-frak
false, slk, slo
true, slv, slv
true, spa, spa
true, spa_old, spa_old
false, sqi, alb
true, srp, srp
true, swa, swa
true, swe, swe
true, tam, tam
true, tel, tel
true, tgl, tgl
true, tha, tha
true, tur, tur
true, ukr, ukr
true, vie, vie

I don't think using the ISO_639 conversion is viable unfortunately :\ I suspect that an internal hash keeping track of codes is going to be necessary.

shishirsharma commented 10 years ago

What is the work around for this. ?

shishirsharma commented 10 years ago

I think you have to use alpha3_terminologic in cases where it is available

lang = 'cze' ; ISO_639.find(lang).alpha3_terminologic.empty? ? ISO_639.find(lang).alpha3 : ISO_639.find(lang).alpha3_terminologic

shishirsharma commented 10 years ago

Do you have any update on this.