manisandro / gImageReader

A Gtk/Qt front-end to tesseract-ocr.
GNU General Public License v3.0
1.6k stars 188 forks source link

gImageReader does not load the html it generates itself #605

Closed snden closed 1 year ago

snden commented 1 year ago

At one point gImageReader crashed unexpectedly, luckily I was saving the html file. When I reloaded this file, gImageReader reports:

$ gimagereader-gtk
Entity: line 69: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xC3 0x26 0x6C 0x74
690; x_fsize 43; x_wconf 23" class="ocrx_word" id="word_6_14" lang="ces">nestač
                                                                               ^
Entity: line 61298: parser error : error parsing attribute name
x_fsize 34; x_wconf 0" class="ocrx_word" id="word_170_32" lang="ces">�&lt;�<upoy
                                                                               ^
Entity: line 61298: parser error : attributes construct error
x_fsize 34; x_wconf 0" class="ocrx_word" id="word_170_32" lang="ces">�&lt;�<upoy
                                                                               ^
Entity: line 61298: parser error : Couldn't find end of Start Tag upoy line 61298
x_fsize 34; x_wconf 0" class="ocrx_word" id="word_170_32" lang="ces">�&lt;�<upoy
                                                                               ^
Entity: line 61308: parser error : StartTag: invalid element name
60; x_fsize 16; x_wconf 0" class="ocrx_word" id="word_170_38" lang="ces">ř&lt;<
                                                                               ^

(gimagereader-gtk:3704): glibmm-ERROR **: 13:54:06.793: 
unhandled exception (type std::exception) in signal handler:
what: Document not well-formed.
Line 61308, column 121 (fatal):
StartTag: invalid element name

gImageReader 3.3.1 (), Linux Mint VERSION="20 (Ulyana)", Xfce 4.14.2

snden commented 1 year ago

When I try to open the hocr file generated by gimagereader in another editor, it always says that it contains non-utf8 characters. If I replace these characters manually, even gimagereader will then read it. It looks like gimagereader doesn't generate hocr file in utf-8 encoding correctly.

manisandro commented 1 year ago

I cannot reproduce this, though I see you are using 3.3.1 rather than 3.4.0. Please upgrade to 3.4.0 and reopen if you hit the issue again.

snden commented 1 year ago

I upgraded to version 3.4.0 () and unfortunately the error still appears.

$ gimagereader-gtk
Entity: line 1218: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xC3 0x26 0x71 0x75
; x_fsize 22; x_wconf 96" class="ocrx_word" id="word_1_957" lang="cs">černobíl
                                                                               ^

I am attaching a test image. Language for OCR: Čeština [ces] test_cs

manisandro commented 1 year ago

That it is unable to open the invalid file is expected, the question is rather whether gimagereader-3.4.0 will still write invalid hOCR html files.

n8marti commented 1 year ago

I can confirm this problem on 3.4.0. Here's what happens when trying to open the hocr file generated by gImageReader (test-image.html.zip): image image Original image shared by @snden: test-image Zipped HTML hOCR file: test-image.html.zip

It's an improvement in that the app doesn't crash, but it still can't open the file.