opensemanticsearch / open-semantic-etl

Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database
https://opensemanticsearch.org/etl
GNU General Public License v3.0
254 stars 69 forks source link

enhance_extract_text_tika_server.py fails unless headers=headers commented out #150

Closed jgillum closed 2 years ago

jgillum commented 2 years ago

Extracting text using Tika fails (error 400) unless line 142 is commented out in enhance_extract_text_tika_server.py:

parsed = parser.from_file(
                    filename=filename,
                    serverEndpoint=tika_server,
                    #headers=headers,
                    requestOptions={'timeout': 60000})

I don't understand why that fixes it; perhaps it's something to do with the latest Python packages and Tika (verison 2.1.0)? Running on Ubuntu 20.

Mandalka commented 2 years ago

Issue with Tika 2.x which doesn't accept X-Tika-OCRTesseractPath anymore: https://github.com/opensemanticsearch/open-semantic-search/issues/389

opensemanticsearch commented 2 years ago

Fixed / migrated to Tika 2.x by #142 (new release next days).