urieli / jochre

Java Optical CHaracter Recognition
GNU Affero General Public License v3.0
22 stars 11 forks source link

Find ways to handle bi-alphabetical Yiddish books #9

Open mirjam-amsterdam opened 6 years ago

mirjam-amsterdam commented 6 years ago

https://archive.org/details/nybc208282 is published in Yiddish in Hebrew letters >and< in parallel in Latin letters. The search results do not acknowledge that. Not only is it not possible to search for text in Latin characters, also the OCR-text-view does not display the pages in Romanized Yiddish. (I was curious because Yiddish in Latin letters is my research field.) It would be nice if these texts would be acknowledged and be searchable. Displaying them in the OCR-flat-text would be nice too, since one can learn things from such a transcription, getting hints of pronunciation. (https://ocr.yiddishbookcenter.org/contents?doc=nybc208282#page4)

urieli commented 6 years ago

Handling different alphabets when indexing is planned for Jochre 3 (a complete rewrite of the OCR engine).

Isolated words in other alphabets (Latin or Cyrillic) exist in many of the books, especially non-fiction (history, etc.). There is also Harkavy's dictionary.

However, I hadn't initially considered the possibility of a single language being written in two different alphabets (in this case Yiddish, written in the Hebrew and Latin alphabet). This means a 2-letter ISO code for a language isn't enough to give the alphabet and the associated lexicon. It turns out ISO has also defined "script codes". I'll probably use these to indicate Yiddish in latin transcription.

mirjam-amsterdam commented 6 years ago

Thanks for considering! Best, Mirjam

On Thu, Nov 15, 2018 at 11:58 AM Assaf Urieli notifications@github.com wrote:

Handling different alphabets when indexing is planned for Jochre 3 (a complete rewrite of the OCR engine).

Isolated words in other alphabets (Latin or Cyrillic) exist in many of the books, especially non-fiction (history, etc.). There is also Harkavy's dictionary.

However, I hadn't initially considered the possibility of a single language being written in two different alphabets (in this case Yiddish, written in the Hebrew and Latin alphabet). This means a 2-letter ISO code for a language isn't enough to give the alphabet and the associated lexicon. It turns out ISO has also defined "script codes https://www.iso.org/schema/isosts/v1.0/doc/n-cvd0.html". I'll probably use these to indicate Yiddish in latin transcription.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/urieli/jochre/issues/9#issuecomment-439001291, or mute the thread https://github.com/notifications/unsubscribe-auth/Aq0GuGDK8WwsoNtvRFq52yS_d6rs3G4Yks5uvUjygaJpZM4YfY-n .