opensemanticsearch / open-semantic-etl

Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database
https://opensemanticsearch.org/etl
GNU General Public License v3.0
254 stars 69 forks source link

Upgrade to Tika 2.x #142

Closed opensemanticsearch closed 2 years ago

opensemanticsearch commented 2 years ago

https://dist.apache.org/repos/dist/release/tika/2.1.0/CHANGES-2.1.0.txt https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0

Mandalka commented 2 years ago

We have to check and maybe change some things in ETL because of the following of the BREAKING CHANGES in 2.0.0:

Mandalka commented 2 years ago

From https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0 "Removed duplicate/triplicate keys

Background: In early 1.x, we had basic metadata keys that were created somewhat ad hoc. We then added metadata keys based on standards such as Dublin Core, or we at least tried to add namespaces to the metadata keys for specific file formats. To maintain backwards compatibility, we kept the old keys and added new keys. This led to quite a bit of metadata bloat, where we'd have the same information two or three times. In Tika 2.x, we slimmed down the metadata keys and relied only on the standards-based or name-spaced keys. In the table below, we document the mappings."

Mandalka commented 2 years ago

Following field name has to be renamed in Open Semantic ETL Tika Plugin:

X-Parsed-By to X-TIKA:Parsed-By

Seems following header (which we use for using our Tesseract OCR cache) doesn't work anymore in Tika 2.x:

X-Tika-OCRTesseractPath

Related documentation: https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr

Related issue: https://github.com/opensemanticsearch/open-semantic-search/issues/389

Mandalka commented 2 years ago

Missing mapping in https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0 (fixed)

Tika 1.x: title Tika 2.x: dc:title

Mandalka commented 2 years ago

Merged https://github.com/opensemanticsearch/open-semantic-etl/pull/152