very slow search+replace

manisandro / gImageReader

A Gtk/Qt front-end to tesseract-ocr.

GNU General Public License v3.0

1.63k stars 190 forks source link

I do OCR german fraktur newspapers. A pdf , size 1-2 GB, contains approx. 1000 pages. OCR is done overnight (i5-3570), that's ok. But the search+replace of 3 specific fraktur chars (fraktur-s, fraktur-hyphen and fraktur-log hyphen ) to the corresponding normal chars lasts eternally: after 4 hr I aborted the program.

It seems, that the used algorithm isnt efficient enough for long files, and could/should be improved. BTW, if I use a bypass (exporting the hocr, replacing the 3 chars e.g. by a perl script, and re-importing the corrected hocr), this is done in minutes.

BTW: any chance to implement a pdf export format, in which the image format remains unchanged? Any format change results in much larger files.

I did some tests trying to workaround the problem. This was possible by first exporting the hocr file, running a perl script doing the search&replace in the hocr file, and then re-importing the resulting hocr file. These procedures just needed a very small fraction of the time compared with the existing routine for search+replace, AND the exported pdf contained the replaced chars/strings!

I didn't check which algorithm is implemented in gImageReader. But, at least for huge numbers of images, IMHO a algorithm as described above, I'd suggest for trying to implement.

BTW, this workaround shows, that gImageReader is - obviously thx to the podofo library - already able to create readable pdfs from a hocr file and images - a task many programs fail to do, eg hocr-tools and many other. I'd suggest to promote this feature after thorough testing.

manisandro / gImageReader

very slow search+replace #680