opensemanticsearch / open-semantic-etl

Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database
https://opensemanticsearch.org/etl
GNU General Public License v3.0
254 stars 69 forks source link

Disable OCR / tesseract #125

Closed nevermind2001 closed 4 years ago

nevermind2001 commented 4 years ago

Hey,

i installed opensemanticsearch in an ubuntu container. i added a mounted network folder with all my pdfs. they are already ocred. i think i disabled ocr in open semantic search but as soon as i start the vm i have up to 4 tesseract jobs running an the vm has 100% cpu usage. what is this and how can i avoid it?

Maybe i have to disable ocr in tika? i googled a lot but i dont understand how to it. Would be great if someone can help me.

thanks a lot

top

Mandalka commented 4 years ago

If you installed the full search engine all in one package:

Click on "Config", select tab "OCR" and just disable "OCR" and disable "OCR images in PDF".

Since you wrote in the repository of ETL package:

If you use only the ETL package, you can disable ocr and the plugin enhance_pdf_ocr in /etc/opensemanticsearch/etl (dont do that if the full search engine installed, since will be overwritten by the config UI)