unescaped ampersand in hOCR output

oliveiracwb / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr

Other

0 stars 0 forks source link

unescaped ampersand in hOCR output #1398

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago

The problem has been reported by Jakub Wilk on Debian BTS

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=774654

but I'm the user who discovered it and I would like to draw your attention to 
the problem.

Tesseract sometimes produces hOCR with unescaped ampersand (making the 
whole XHTML file ill-formed), a minimal example is included in Jakub Wilk's 
report.

Regards

Janusz

Original issue reported on code.google.com by jsb...@mimuw.edu.pl on 6 Jan 2015 at 7:00

GoogleCodeExporter commented 9 years ago

Fixed in 09b0c91fc9bd.

HOcrEscape was used only for one symbol string. But in this case 
GetUTF8Text(RIL_SYMBOL) returns 2 symbols ("&c") in string, so it was not 
escaped. I removed this limit and run just few test. It shows no problem so 
far...

Original comment by zde...@gmail.com on 6 Feb 2015 at 10:53

Changed state: Fixed

Attachments:

test.png

GoogleCodeExporter commented 9 years ago

I hope the fix is general enough, but I don't understand the origin of the 
problem. 
Why GetUTF8Text(RIL_SYMBOL) returns 2 symbols? why just "&c"? I see no relation 
to the code point of Tironian et.Can this happen also for other code points?

Original comment by bjanusz...@gmail.com on 11 Feb 2015 at 7:26