tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.08k stars 9.5k forks source link

Search&Replace in hOCR #4289

Closed bruzzler5 closed 2 months ago

bruzzler5 commented 2 months ago

Your Feature Request

Hello,

I'm using tesseract to OCR scans of German fraktur newspapers and fraktur documents from NARA. Usually the scans have a size of up to 1-2 GB. OCR with tesseract usually runs 3-4 hr on my desktop machine. That's ok. After OCR I do a search and replace for some specific fraktur letters to usual latin letters together with some spelling corrections. And, this search and replace take very, very much time. a) is there any change to get a quicker search and replace implemented in tesseract? b) could I export the uncorrected hOCR, do a search and replace in the hOCR e.g. by a perl script, and then re-import the hOCR, without corrupting the final pdf product? IMHO, if the length of the replacement string is almost the same as searched string, the box size should be the same?

zdenop commented 2 months ago

a) No. Tesseract is OCR engine. For postprocessing, you need to use other tools. b) hOCR is the tesseract output. You can postprocess it and then use other external tools to embed it into pdf.

stweil commented 2 months ago

@bruzzler5, I also use Tesseract for German newspapers with Fraktur script. Maybe some hints on https://ocr-bw.bib.uni-mannheim.de/faq/ might be helpful for you. My latest Tesseract models are available from https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/german_print_20231218.

But why are your scans so large? Or is 2 GB the total size of many page scans?

We use the OCR ALTO XML results without any postprocessing. Search and replace could be done with perl -pi -e [...] which is very fast.

bruzzler5 commented 2 months ago

@bruzzler5, I also use Tesseract for German newspapers with Fraktur script. Maybe some hints on https://ocr-bw.bib.uni-mannheim.de/faq/ might be helpful for you. My latest Tesseract models are available from https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/german_print_20231218.

But why are your scans so large? Or is 2 GB the total size of many page scans?

We use the OCR ALTO XML results without any postprocessing. Search and replace could be done with perl -pi -e [...] which is very fast.

thx for your replies.

1-2 GB is the size of approx. 1000 scanned pages (8 bit RGB) of a local pre-WW2-newspaper we're currently OCRizing. We are using the frak2021_1.069 model of the Uni Mannheim. The recognition is equal or even better than the OCR machine the https://digitale-sammlungen.de uses for fraktur.

Last night, tesseract/gImageReadder needed approx. 10 h to OCR that approx. 1000 pages in a 1.3 GB file. But, replacing 3 chars (fraktur-s, fraktur-hyphens) is still running (>4h) in gImageReader, the used algorithm seems curious. It seems, that gImagereader is the problem.

So, after OCRizing with tesseract, I'll switch to some perl script and post-processing with other hOCR-tools .