Do automatic textrecognition (OCR) for images later and in background - Githubissues

opensemanticsearch / open-semantic-search

Open Source research tool to search, browse, analyze and explore large document collections by Semantic Search Engine and Open Source Text Mining & Text Analytics platform (Integrates ETL for document processing, OCR for images & PDF, named entity recognition for persons, organizations & locations, metadata management by thesaurus & ontologies, search user interface & search apps for fulltext search, faceted search & knowledge graph)

https://opensemanticsearch.org

GNU General Public License v3.0

942 stars 164 forks source link

Do automatic textrecognition (OCR) for images later and in background #178

Open Mandalka opened 5 years ago

Mandalka commented 5 years ago

Do automatic textrecognition (OCR) for (embedded) images (many analysis time for few additional text) later and in background, so most other data/documents or most parts of the documents will be searchable many times faster.

Mandalka commented 5 years ago

Implemented additional file indexing queue with lower priority and additional plugins like OCR by setting additional_plugins_later so reindexing with additioan plugins like OCR done later, after all documents indexed faster without OCR before.

Todo: UI option for that in Web Admin config UI, so no need for editing ETL config by editor.

Mandalka commented 5 years ago

Implemented config option additional_plugins_later_config so we can not only add additonal plugins but reconfigure yet runned plugins.

So we can disable Tika's OCR option on first run and enable Tikas OCR option on second run of same plugin, too.

Mandalka commented 5 years ago

Added UI option in Web Admin config UI.

Mandalka commented 5 years ago

REST-API (used by file monitoring) for file indexing now using index_filedirectory for single files, too, which adds the file to multiple different priorized queues, if option like OCR later is on.

Mandalka commented 4 years ago

Todo for OCR by Web page importer

Mandalka commented 4 years ago

Implemented UI to prioritize certain files for OCR by https://github.com/opensemanticsearch/open-semantic-search/issues/251