opensemanticsearch / open-semantic-etl

Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database
https://opensemanticsearch.org/etl
GNU General Public License v3.0
254 stars 69 forks source link

If OCR in later stage, only start (re)process with enabled OCR, if (embedded) images #116

Closed opensemanticsearch closed 4 years ago

opensemanticsearch commented 4 years ago

For more performance / preventing unnecessary tasks: If OCR in later stage, only add task to start (re)process / OCR, if content type image or embedded image(s) in document

Mandalka commented 4 years ago

Implemented Fake Tesseract CLI wrapper in Repo tesseract-ocr-cache (which is now submodule of Open Semantic ETL) so we get more status before real OCR running.

Plugin enhance_extract_tika_server using this status to set status / disable further OCR plugins.