urieli / jochre

Java Optical CHaracter Recognition
GNU Affero General Public License v3.0
22 stars 11 forks source link

Latin text not rendered in OCRed text #32

Open mirjam-amsterdam opened 5 years ago

mirjam-amsterdam commented 5 years ago

https://ocr.yiddishbookcenter.org/contents?doc=nybc202767#page24

Latin text within the Yiddish text is not rendered, but there also is no placeholder indicating that some text is missing and that the reader should go to the original scan (before preparing an e-book, or before quoting etc.)

latin text missing 2 latin text missing 1

urieli commented 5 years ago

Yes, I made the mistake in the early analyses to configure a "junk setting", which ignores text if the confidence score is too low. This means certain passages (typically other alphabets) are simply skipped. In the newer analyses this should no longer be the case. However, I'd rather wait for the new version of Jochre to fix this, as this version should be able to handle multiple alphabets.

mirjam-amsterdam commented 2 years ago

Stumbled over a misreading: when searching for מאַנש I get a result that actually is in Latin letters Wien ! Please, do make Latin letters searchable and show them as Latin letters in the text. And don't treat me with false results when I am looking for Mansch...

Wien source of Mansch - Wien