opensemanticsearch / open-semantic-search

Open Source research tool to search, browse, analyze and explore large document collections by Semantic Search Engine and Open Source Text Mining & Text Analytics platform (Integrates ETL for document processing, OCR for images & PDF, named entity recognition for persons, organizations & locations, metadata management by thesaurus & ontologies, search user interface & search apps for fulltext search, faceted search & knowledge graph)
https://opensemanticsearch.org
GNU General Public License v3.0
976 stars 169 forks source link

Ubuntu 18.04: Open Semantic Search Server 20.04.17 unable to index content of pdf file #310

Open RiteshSingh opened 4 years ago

RiteshSingh commented 4 years ago

First of all, thank you so much for this most valuable application.

I am unable to index the content on pdf files in Ubuntu 18.04.

This is the error while executing the index command opensemanticsearch-index-dir ./gandhi:

Repeating indexing of unchanged file because critical plugin(s) ['enhance_extract_text_tika_server'] failed in former run: /home/bodhi/Downloads/gandhi/mahatma-gandhi-collected-works-volume-98.pdf

The web interface gives these 3 errors (screenshot attached):

 Failed tasks while import & analysis (ETL)

    enhance_extract_text_tika_server (2) -
    enhance_file_mtime (2) -
    filter_file_not_modified (2) -

shot-2020-10-21_12-31-02

What I have tried many times:

  1. Restarting the machine
  2. Enabling enhance-pdf-ocr in connector-files:
    #Enable OCR for images inside PDF files
    config['plugins'].append('enhance_pdf_ocr')
  3. Re-indexing

Please guide.

RiteshSingh commented 4 years ago

Same error occurs even in Open Semantic Search Server 20.01.17

Mandalka commented 4 years ago

Seems the user opensemanticetl (the service indexing the files in message queue parallel) has no rights to access the files (maybe more error messages in tab "Import & analysis process (ETL)" of the file preview) or as root by

service opensemanticetl status

RiteshSingh commented 4 years ago

service opensemanticetl status gives the following error:

Nov 08 16:17:58 lubuntu etl_tasks[17389]: I/O Error: Couldn't open file './gandhi/mahatma-gandhi-collected-works-volume-98.pdf': No such file or directory.

mosea3 commented 3 years ago

try to give index commands in absolute paths and not relative paths. so for you I guess

opensemanticsearch-index-file /gandhi/mahatma-gandhi-collected-works-volume-98.pdf