opensemanticsearch / open-semantic-etl

Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database
https://opensemanticsearch.org/etl
GNU General Public License v3.0
255 stars 69 forks source link

Maximum concurrent Tesseract jobs? #110

Closed NetwarSystem closed 4 years ago

NetwarSystem commented 4 years ago

I set up an OSS VM for testing, giving it ten cores and 16 gig of memory. I placed 600 screen shots, mostly from Twitter, in a directory and started indexing them. The system used all 16 gig of memory, filled swap, and then kept going but terribly slow. I shut it down, upgraded the VM to 24 gig, and restarted. I see there are 104 concurrent Tesseract sessions and memory usage is going up in steady 100 meg increments every few seconds.

Is there a method to limit maximum processes for Tika, which is what invokes Tesseract?

opensemanticsearch commented 4 years ago

I implemented (will be available in next DEB package next days) a config of max parallel ETL tasks by environment variable OPEN_SEMANTIC_ETL_CONCURRENCY which else the default is count of CPUs.

If all parallel processed files have to run OCR such a tasks call Tesseract in each parallel ETL task, where each tesseract call opens 4 threads per task.

If you have 8 CPUs, without finetuning/next release there are up to 8 parallel ETL tasks with potential 8*4 = 32 Tesseract threads.

As next step i will limit Tesseract calls to one thread per parallel task, so there will be maximum as much tesseract threads like CPUs / ETL Tasks.

From Tesseract FAQ https://github.com/tesseract-ocr/tesseract/wiki/FAQ#can-i-increase-speed-of-ocr:

Tesseract 4 also uses up to four CPU threads while processing a page, so it will be faster than Tesseract 3 for a single page.

If your computer has only two CPU cores, then running four threads will slow down things significantly and it would be better to use a single thread or maybe a maximum of two threads! Using a single thread eliminates the computation overhead of multithreading and is also the best solution for processing lots of images by running one Tesseract process per CPU core.

Set the maximum number of threads using the environment variable OMP_THREAD_LIMIT.

To disable multithreading, use OMP_THREAD_LIMIT=1

Mandalka commented 4 years ago

I disabled Tesseract multithreading in default config for Tika-Server by https://github.com/opensemanticsearch/tika-server.deb/commit/9d6cbb9b3773ad03fceeaabef5a63766e4271230 and for OCR of PDF by https://github.com/opensemanticsearch/open-semantic-etl/commit/f01c07b7f36984d4190fcdf916d6bcbade92a0fe

Will be available as new package and VM tomorrow.