opensemanticsearch / open-semantic-search

Open Source research tool to search, browse, analyze and explore large document collections by Semantic Search Engine and Open Source Text Mining & Text Analytics platform (Integrates ETL for document processing, OCR for images & PDF, named entity recognition for persons, organizations & locations, metadata management by thesaurus & ontologies, search user interface & search apps for fulltext search, faceted search & knowledge graph)
https://opensemanticsearch.org
GNU General Public License v3.0
962 stars 169 forks source link

Failed tasks while import & analysis (ETL) enhance_extract_text_tika_server #289

Open DennisNDean opened 4 years ago

DennisNDean commented 4 years ago

When running opensemanticsearch-index-dir on directory of .PDFs to do text extraction on, they're all being marked as

Failed tasks while import & analysis (ETL)
enhance_extract_text_tika_server (777) -
enhance_file_mtime (777) -
filter_file_not_modified (777) -

If I re-run the index, each one says Repeating indexing of unchanged file because critical plugin(s ['enhance_extract_text_tika_server'] failed in former run: followed by the path.

If I run it in verbose mode, each looks like this:

Repeating indexing of unchanged file because critical plugin(s) ['enhance_extract_text_tika_server'] failed in former run: /mnt/s3-bucket/LBPDPublicDocs/PUBLIC PRA Documents/Use of Force/ICD 08:29:17-3420 Pacific Pl/Photos/Photos_Part116.pdf
Starting plugin enhance_file_mtime
File modification time: 2020-06-12T00:22:07Z
Starting plugin enhance_path
Starting plugin enhance_entity_linking
Entity linking / Solr Text Tagger result for tagger all_labels_ss_tag: {
  "responseHeader":{
    "status":0,
    "QTime":0},
  "tagsCount":0,
  "tags":[],
  "response":{"numFound":0,"start":0,"docs":[]
  }}

Named Entity Linking by Tagger all_labels_ss_tag: {}
Starting plugin enhance_multilingual
Multilinguality: Add filename_extension_s to _text_
Multilinguality: Add filename_extension_s to text_txt_en
Multilinguality: Add path0_s to _text_
Multilinguality: Add path0_s to text_txt_en
Multilinguality: Add path1_s to _text_
Multilinguality: Add path1_s to text_txt_en
Multilinguality: Add path2_s to _text_
Multilinguality: Add path2_s to text_txt_en
Multilinguality: Add path3_s to _text_
Multilinguality: Add path3_s to text_txt_en
Multilinguality: Add path4_s to _text_
Multilinguality: Add path4_s to text_txt_en
Multilinguality: Add path_basename_s to _text_
Multilinguality: Add path_basename_s to text_txt_en
Starting plugin export_solr
Starting Exporter: Solr
Sending update request to http://localhost:8983/solr/opensemanticsearch/update
[{"etl_file_b": {"set": true}, "etl_error_plugins_ss": {"set": []}, "etl_error_txt": {"set": []}, "etl_error_enhance_mapping_id_txt": {"set": []}, "etl_enhance_mapping_id_time_millis_i": {"set": 0}, "etl_enhance_mapping_id_b": {"set": true}, "etl_error_filter_blacklist_txt": {"set": []}, "etl_filter_blacklist_time_millis_i": {"set": 0}, "etl_filter_blacklist_b": {"set": true}, "etl_error_filter_file_not_modified_txt": {"set": []}, "etl_enhance_file_mtime_b": {"set": true}, "etl_enhance_path_b": {"set": true}, "etl_enhance_entity_linking_b": {"set": true}, "etl_enhance_multilingual_b": {"set": true}, "etl_enhance_pdf_ocr_b": {"set": false}, "etl_enhance_extract_text_tika_server_ocr_enabled_b": {"set": false}, "etl_filter_file_not_modified_time_millis_i": {"set": 5}, "etl_filter_file_not_modified_b": {"set": true}, "etl_error_enhance_file_mtime_txt": {"set": []}, "file_modified_dt": {"set": "2020-06-12T00:22:07Z"}, "etl_enhance_file_mtime_time_millis_i": {"set": 0}, "etl_error_enhance_path_txt": {"set": []}, "filename_extension_s": {"set": "pdf"}, "path0_s": {"set": "LBPDPublicDocs"}, "path1_s": {"set": "PUBLIC PRA Documents"}, "path2_s": {"set": "Use of Force"}, "path3_s": {"set": "ICD 08:29:17-3420 Pacific Pl"}, "path4_s": {"set": "Photos"}, "path_basename_s": {"set": "Photos_Part116.pdf"}, "etl_enhance_path_time_millis_i": {"set": 0}, "etl_error_enhance_entity_linking_txt": {"set": []}, "etl_enhance_entity_linking_time_millis_i": {"set": 3}, "etl_error_enhance_multilingual_txt": {"set": []}, "_text_": {"set": ["pdf", "LBPDPublicDocs", "PUBLIC PRA Documents", "Use of Force", "ICD 08:29:17-3420 Pacific Pl", "Photos", "Photos_Part116.pdf"]}, "text_txt_en": {"set": ["pdf", "LBPDPublicDocs", "PUBLIC PRA Documents", "Use of Force", "ICD 08:29:17-3420 Pacific Pl", "Photos", "Photos_Part116.pdf"]}, "etl_enhance_multilingual_time_millis_i": {"set": 0}, "etl_error_export_solr_txt": {"set": []}, "id": "LBPDPublicDocs/PUBLIC PRA Documents/Use of Force/ICD 08:29:17-3420 Pacific Pl/Photos/Photos_Part116.pdf"}]
Starting plugin export_queue_files
Starting Exporter: Solr
Not exported to Solr because no data or yet exported in this ETL run, because exporter was runned as plugin.
Scanning file: Photos_Part117.pdf
Starting plugin enhance_mapping_id
Starting plugin filter_blacklist
Starting plugin filter_file_not_modified

Any ideas? Thanks!

robinjos333 commented 4 years ago

I am facing the same issue

Failed tasks while import & analysis (ETL)

RiteshSingh commented 3 years ago

I am facing same issue in Ubuntu 18 and Debian.