jwilk-archive / ocrodjvu

OCR for DjVu
GNU General Public License v2.0
44 stars 19 forks source link

Sometimes ampersand is not escaped in the hOCR output #15

Closed jwilk closed 8 years ago

jwilk commented 9 years ago

Issue reported by @jsbien:

It happens when a character looking like r rotunda (?) is "recognized" as &. I guess it is a tesseract bug and should be reported to the developers, but just in case I report it first here.

An example is available at http://teksty.klf.uw.edu.pl/23/.


LindeIIGP4ocr170.json LindeIIGP4ocr409.json LindeIIGP4ocr530.json LindeIIGP4ocr577.json LindeIIGP5ocr529.json LindeIIGP5ocr543.json LindeIIGP5ocr551.json LindeIIGP5ocr585.json

jwilk commented 9 years ago

I believe the character in question is Tironian et.

I've filed a bug against Tesseract: https://bugs.debian.org/774654

Is there anything that you want me to do on the ocrodjvu side, or shall I close this bug?

jwilk commented 9 years ago

Comment submitted by @jsbien:

Thank you very much for identifying the character!

Thank you for creating a minimal example (I tried to do it myself, but was confused by "Skipping this page") and submitting the bug report. I've created an issue upstream linked to your report: https://code.google.com/p/tesseract-ocr/issues/detail?id=1398.

We see what will happen :-)

jwilk commented 9 years ago

Comment submitted by @jsbien:

Fixed upstream in https://github.com/tesseract-ocr/tesseract/commit/09b0c91fc9bdb7a665416a0056e40823fad8e235.

At least I hope so. Zdenko Podobný's comment is unclear for me: why GetUTF8Text(RIL_SYMBOL) returns 2 symbols? why just "&c"? I see no relation to the code point of Tironian et.

jwilk commented 8 years ago

Comment submitted by @jsbien:

Just made sure that the bug has been corrected also in the current Tesseract version (3.04.01). Thanks again!