Closed snden closed 1 year ago
When I try to open the hocr file generated by gimagereader in another editor, it always says that it contains non-utf8 characters. If I replace these characters manually, even gimagereader will then read it. It looks like gimagereader doesn't generate hocr file in utf-8 encoding correctly.
I cannot reproduce this, though I see you are using 3.3.1 rather than 3.4.0. Please upgrade to 3.4.0 and reopen if you hit the issue again.
I upgraded to version 3.4.0 () and unfortunately the error still appears.
$ gimagereader-gtk
Entity: line 1218: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xC3 0x26 0x71 0x75
; x_fsize 22; x_wconf 96" class="ocrx_word" id="word_1_957" lang="cs">černobíl
^
I am attaching a test image. Language for OCR: Čeština [ces]
That it is unable to open the invalid file is expected, the question is rather whether gimagereader-3.4.0 will still write invalid hOCR html files.
I can confirm this problem on 3.4.0. Here's what happens when trying to open the hocr file generated by gImageReader (test-image.html.zip): Original image shared by @snden: Zipped HTML hOCR file: test-image.html.zip
It's an improvement in that the app doesn't crash, but it still can't open the file.
At one point gImageReader crashed unexpectedly, luckily I was saving the html file. When I reloaded this file, gImageReader reports:
gImageReader 3.3.1 (), Linux Mint VERSION="20 (Ulyana)", Xfce 4.14.2