Closed n8marti closed 1 year ago
I don't know C++, but it seems like something might be going wrong with this function, Utils::string_html_escape: https://github.com/manisandro/gImageReader/blob/master/gtk/src/Utils.cc#L382
I've seen instances where "normal" characters get improperly replaced, e.g. "n" gets replaced with "&" here, as if it were actually "&" in the text:
<span title="bbox 512 1454 846 1499; x_fsize 11; x_wconf 95" class="ocrx_word" id="word_7_208" lang="fr_BE">“Développeme&t</span>
Thanks for the hint, should be fixed by https://github.com/manisandro/gImageReader/commit/d3a809cf4c10cb0b0392ed8e7b05e1a5824222a7
Thanks. Will this show up as a new package in the ubuntu PPA?
I've submitted the PPA builds for jammy and kinetic.
Great. Any chance you could have it build for focal, too?
Done
This is related to #605 , but that discusses opening malformed HTML files. It seems gImageReader is sometimes improperly encoding the text when exporting to HTML.
I have an hOCR file generated by gImageReader that includes this content:
Notice the 2nd line's text, "Sociét\C3<". In the OCR output the character exported as "\C3<" is actually the same as earlier in the word, é. I have no idea why this would happen, but it doesn't always happen. That same text was fine in a previous save of the file, but there were similar errors elsewhere.
In fact, after multiple cycles of open, save, reopen, there are quite a few inconsistencies between the subsequent files, even though no edits were made in gImageReader. I've zipped 3 HTML files: Kako-Francais_24421.clean.html, Kako-Francais_24421.dirty.01-26-0926.html, and Kako-Francais_24421.dirty.01-26-0928.html. The file with "clean" in the name opened correctly in gImageReader. I then immediately saved the content as a new HTML file, "...dirty.01-26-0926.html". That failed to reopen in the app. So I reopened the "clean" file again, saved again, and the 3rd file also failed to open. But all 3 files have differences in how the text of seemingly random words is exported to the tag. In other words, the export process is inconsistent given the same starting data. Many of the errors I've seen happen when a character is accented with a grave or acute diacritic, but sometimes it has also happened with basic Latin characters, such as one instance of "n" being saved in the HTML as ">"
Kako-Francais_24421_html.zip