opensemanticsearch / open-semantic-search

Open Source research tool to search, browse, analyze and explore large document collections by Semantic Search Engine and Open Source Text Mining & Text Analytics platform (Integrates ETL for document processing, OCR for images & PDF, named entity recognition for persons, organizations & locations, metadata management by thesaurus & ontologies, search user interface & search apps for fulltext search, faceted search & knowledge graph)
https://opensemanticsearch.org
GNU General Public License v3.0
957 stars 167 forks source link

'enhance_extract_text_tika_server' error message #357

Open RabbitJackTrade opened 3 years ago

RabbitJackTrade commented 3 years ago

Newbie here, so please pardon if I'm missing something:

I'm running the VM in Oracle Virtual Box under Windows 10 (all current versions).

I tried indexing a file (always a Microsoft Word docuemnt) using the browser (search-apps/files/create) - the response I get is

File or directory added to queue.

The file name shows up in the Newest documents tab, but the content is never indexed.

Trying the same thing using CLI

opensemanticsearch-index-dir /path/to/filename

gets this response

Indexing new file: /path/to/filename

but the indexing never takes place. When I run this again, the response this time is

Repeating indexing of unchanged file because critical plugin(s) ['enhance_extract_text_tika_server'] failed in former run: /path/to/filename

or, on occasion

Repeating indexing of unchanged file because (additional configured) plugin(s) or options ['enhance_extract_text_tika_server_ocr_enabled'] not runned yet: /path/to/filename

As I mentioned - all documents are in Microsoft Word format, so I'm not sure what ocr has to do with it. I've seen references to the first error message but couldn't find a solution.

Thanks.

denispol commented 1 year ago

I confirm that this happens as well with .pdf and other Office formats (.xls, .xlsx), using the latest from master.

AndreaPux commented 1 year ago

Same problem here. Honestly, Open Semantic Search seems a wonderful tool, but it's a quite frustrating experience. I spent one week trying to install OSS on Ubuntu LTS, and the only solution was to use Debian instead inspite of what was claimed in the docs. Now, on Debian the tools is installed but it doesn't index the files content, and what I get here is that the problem is known from 2021 and there's no proposed solution.