Closed opensemanticsearch closed 4 years ago
Implemented Fake Tesseract CLI wrapper in Repo tesseract-ocr-cache (which is now submodule of Open Semantic ETL) so we get more status before real OCR running.
Plugin enhance_extract_tika_server using this status to set status / disable further OCR plugins.
For more performance / preventing unnecessary tasks: If OCR in later stage, only add task to start (re)process / OCR, if content type image or embedded image(s) in document