generated HTML file sometimes contains invalid (non UTF-8) characters

n8marti commented 1 year ago

This is related to #605 , but that discusses opening malformed HTML files. It seems gImageReader is sometimes improperly encoding the text when exporting to HTML.

I have an hOCR file generated by gImageReader that includes this content:

    <span title="baseline 0.002 -14; bbox 461 1973 1279 2023; x_ascenders 12; x_descenders 12; x_size 48" class="ocr_line" id="line_3_15">
     <span title="bbox 461 1973 614 2010; x_fsize 12; x_wconf 96" class="ocrx_word" id="word_3_41" lang="fr_BE">Sociét\C3&lt;</span>
     <span title="bbox 634 1975 931 2010; x_fsize 12; x_wconf 96" class="ocrx_word" id="word_3_42" lang="fr_BE">Internationale</span>
     <span title="bbox 951 1976 1000 2011; x_fsize 12; x_wconf 96" class="ocrx_word" id="word_3_43" lang="fr_BE">de</span>
     <span title="bbox 1020 1975 1279 2023; x_fsize 12; x_wconf 96" class="ocrx_word" id="word_3_44" lang="fr_BE">Linguistique</span>
    </span>

Notice the 2nd line's text, "Sociét\C3<". In the OCR output the character exported as "\C3<" is actually the same as earlier in the word, é. I have no idea why this would happen, but it doesn't always happen. That same text was fine in a previous save of the file, but there were similar errors elsewhere.

In fact, after multiple cycles of open, save, reopen, there are quite a few inconsistencies between the subsequent files, even though no edits were made in gImageReader. I've zipped 3 HTML files: Kako-Francais_24421.clean.html, Kako-Francais_24421.dirty.01-26-0926.html, and Kako-Francais_24421.dirty.01-26-0928.html. The file with "clean" in the name opened correctly in gImageReader. I then immediately saved the content as a new HTML file, "...dirty.01-26-0926.html". That failed to reopen in the app. So I reopened the "clean" file again, saved again, and the 3rd file also failed to open. But all 3 files have differences in how the text of seemingly random words is exported to the tag. In other words, the export process is inconsistent given the same starting data. Many of the errors I've seen happen when a character is accented with a grave or acute diacritic, but sometimes it has also happened with basic Latin characters, such as one instance of "n" being saved in the HTML as ">"

Kako-Francais_24421_html.zip

n8marti commented 1 year ago

I don't know C++, but it seems like something might be going wrong with this function, Utils::string_html_escape: https://github.com/manisandro/gImageReader/blob/master/gtk/src/Utils.cc#L382

I've seen instances where "normal" characters get improperly replaced, e.g. "n" gets replaced with "&" here, as if it were actually "&" in the text:

     <span title="bbox 512 1454 846 1499; x_fsize 11; x_wconf 95" class="ocrx_word" id="word_7_208" lang="fr_BE">“Développeme&amp;t</span>

manisandro commented 1 year ago

Thanks for the hint, should be fixed by https://github.com/manisandro/gImageReader/commit/d3a809cf4c10cb0b0392ed8e7b05e1a5824222a7

n8marti commented 1 year ago

Thanks. Will this show up as a new package in the ubuntu PPA?

manisandro commented 1 year ago

I've submitted the PPA builds for jammy and kinetic.

n8marti commented 1 year ago

Great. Any chance you could have it build for focal, too?

manisandro commented 1 year ago

Done

manisandro / gImageReader

generated HTML file sometimes contains invalid (non UTF-8) characters #615