manisandro / gImageReader

A Gtk/Qt front-end to tesseract-ocr.
GNU General Public License v3.0
1.63k stars 190 forks source link

Corrupted PDF file export with hOCR data #486

Closed tukusejssirs closed 3 years ago

tukusejssirs commented 3 years ago

@manisandro, I believe it is either not yet fixed or not fully fixed.

I installed gIR v3.3.1 (GTK) from Fedora repos, loaded 846 images (mostly text, three simple tables + front and back covers as images). All images processed in ST Advanced and in 600 dpi. An image size is around 500 KiB, 5 images are between 1.2 and 2.9 MiB and the covers are 16 MiB and 36 MiB respectively. Altogether, their size is 251 MiB.

gIR crashes a lot, loading hOCR HTML takes some time (I have 10 GB RAM installed in this computer, 2 cores, 4 threads).

However, sometimes it works (albeit slowly). I’ve exported the PDF file, but both Evince and Adobe Acrobat Reader (both on Linux) fails to open the file and say that it is corrupted/damaged.

I have no idea how to generate a PDF file with hOCR data. I’ve read about hocr-tools package and its hocr-pdf command; and about hocr2pdf from exactImage package, but I could not make such PDF (from either image files or pre-created PDF file and from hOCR data).

Originally posted by @tukusejssirs in https://github.com/manisandro/gImageReader/issues/424#issuecomment-758819893

tukusejssirs commented 3 years ago

@manisandro, you have requested some traces. I have none nor I could reproduce it ad hoc. However, I have some core dumps available. However, 41 dumps I have 3 January 2021, take up 13 GiB, therefore I compressed the text the coredumpctl [pid] command outputs. You could check them all, I suggest to start with the latest (number 40). When you find something, you could tell me the number and I’ll upload the dump somewhere.

Note that not all core dumps are connected with this issue. Some crashes are connected with #479 (in general: with HTML entities, esp the ampersand character), others with the image data or whatever else I couldn’t figure out.

dump_texts.zip

manisandro commented 3 years ago

This looks like a crash in Gtk - if you run the application from the terminal, do you see any output when this happens?