OCR taking too long - Githubissues

manisandro / gImageReader

A Gtk/Qt front-end to tesseract-ocr.

GNU General Public License v3.0

1.6k stars 188 forks source link

OCR taking too long #578

Closed vivadavid closed 1 year ago

vivadavid commented 2 years ago

Hi!

I've performed OCR on a book consisting on 354 PNG images (the originals were in JP2, but I converted them because the programme crashed every time). This is the source:

https://archive.org/details/19261928Liberacin

My settings:

Mode: hOCR, PDF.
Language: Spanish [spa] es.
Segmentation: automatic.

It took around 28-29 minutes.

I did the same thing with VietOCR and it took less that 8 minutes.

I wanted to report it in case I did something wrong or in case there is a bug.

Thank you for your time! I love your programme!

manisandro commented 1 year ago

Sorry for the late reply.

See https://github.com/tesseract-ocr/tesseract/issues/1662, if your tesseract ist build with openmp support, this is likely the reason. You should either rebuild tesseract with openmp support disabled (as is upstream default and recommendation), or set the export OMP_THREAD_LIMIT=1 environment variable before launching gImageReader, ffor example on Linux with gimagereader-qt5:

$ export OMP_THREAD_LIMIT=1 gimagereader-qt5

vivadavid commented 1 year ago

Sorry for the late reply.

See tesseract-ocr/tesseract#1662, if your tesseract ist build with openmp support, this is likely the reason. You should either rebuild tesseract with openmp support disabled (as is upstream default and recommendation), or set the export OMP_THREAD_LIMIT=1 environment variable before launching gImageReader, ffor example on Linux with gimagereader-qt5:
$ export OMP_THREAD_LIMIT=1 gimagereader-qt5

Hi, thanks for your reply! It looks a bit complicated, and the language packages I use for Tesseract are downladed through gImageReader anyway. Is it something that could be fixed or adjusted in a future release of your programme?

manisandro commented 1 year ago

If you are using the latest 3.4.1 Windows build, the bundled tesseract is compiled without OpenMP support, so it should not suffer from the performance penalty.

vivadavid commented 1 year ago

If you are using the latest 3.4.1 Windows build, the bundled tesseract is compiled without OpenMP support, so it should not suffer from the performance penalty.

I've just tried version 3.4.1 and, from 28-29 minutes, this time the OCR process took around 5 minutes 30 seconds, so that's great! Thanks!

I suppose I should open a new thread, but as I described in my first message, I keep getting an error message whenever I want to import JP2 images. Isn't this format supported?

gImageReader - 000029

manisandro commented 1 year ago

Looks like a crash in the Jasper JP2 library - can you share the image which triggers this?

vivadavid commented 1 year ago

Looks like a crash in the Jasper JP2 library - can you share the image which triggers this?

There you go:

jp2_file.zip

vivadavid commented 1 year ago

Hi, @manisandro , just a quick message to let you know that I've just tried the OCR tool in PDF24 and my JP2 images weren't supported either. It must be a general issue.

manisandro commented 1 year ago

I see that there is an assertion error in the jasper jp2 image library which triggers the crash. I Haven't had the time to debug it further though.

danpla commented 1 year ago

@manisandro A small tip. On Unix-like systems, you can do the OMP_THREAD_LIMIT workaround right from the executable via setenv followed byexecvp somewhere at the beginning of main() (example).

manisandro commented 1 year ago

I could also limit the number of threads via openmp API, but I'd rather not, as there are other parts in gimagereader which truely benefit from parallelism, so the proper solution really is to ensure that tesseract is build properly.