manisandro / gImageReader

A Gtk/Qt front-end to tesseract-ocr.
GNU General Public License v3.0
1.6k stars 188 forks source link

[Ubuntu 20.10] big files=program hangs. heavy use of memory? #505

Open hollisticated-horse opened 3 years ago

hollisticated-horse commented 3 years ago

Hi, new ticket, different issue: it seems that with big pdfs, the software hangs or has a hard time staying stable ? The "force to quit or wait dialogue" comes and goes... Since i don't have the technical skills to actually seek and correct this, can i in anyway help diagnose or provide info to debug ?

Edit: the original file is a 144~Mo .pdf same size for the html generated 1097 pages, full texte, and images...

hollisticated-horse commented 3 years ago

seems to use quite a bit of memory... could it not be stored in a tmp file on the go, to avoid hogging the memory ? went from 5 to 8+ Go on a 1000 page pdf.

hollisticated-horse commented 3 years ago

it finished loading, I was able to save the hocr output in a .html file. But now it doesn't want to load it anymore. Running gimagereader-gtk --gtk-debug=FLAGS dumps this when i try to open the .html file generated:

Bytes: 0xE2 0x80 0x26 0x71
fsize 10; x_wconf 76" class="ocrx_word" id="word_34_119" lang="eng">“inherited
                                                                               ^
Entity: line 20949: parser error : EntityRef: expecting ';'
 x_fsize 9; x_wconf 35" class="ocrx_word" id="word_35_244" lang="eng">(�&�&�
                                                                               ^
Entity: line 962995: parser error : EntityRef: expecting ';'
 x_fsize 9; x_wconf 56" class="ocrx_word" id="word_824_393" lang="eng">�&�&�
                                                                               ^

(gimagereader-gtk:277463): glibmm-ERROR **: 19:08:03.778: 
unhandled exception (type std::exception) in signal handler:
what: 
Validity error:
Line 1028405, column 131 (error):
xmlSAX2Characters: huge text node

Trace/breakpoint trap (core dumped)

Is the texte so huge it can't handle it or is there an encoding error ?

hollisticated-horse commented 3 years ago

Got an official crash when exporting to ODT: backtrace_gimage.txt backtrace_gimage 2 .txt

then when importing original .xml file generated from gimage:

import_xml_backtrack_gimage.txt

what to do ?

hollisticated-horse commented 3 years ago

Have i found the limit ?