Upgrade to Tika 2.x - Githubissues

opensemanticsearch commented 2 years ago

https://dist.apache.org/repos/dist/release/tika/2.1.0/CHANGES-2.1.0.txt https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0

Mandalka commented 2 years ago

We have to check and maybe change some things in ETL because of the following of the BREAKING CHANGES in 2.0.0:

"OCR is now triggered automatically for PDFs if tesseract is on the user's path see (https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr) for how to disable OCR." Must check if disable/enable OCR our way works with Tika 2.x as well.
"Parsers can be configured via tika-config.xml on instantiation. We have moved away from configuration via .properties files because of confusion among users. This affects the PDFParser, TesseractOCRParser and the StringsParser." Move our custom changes in this properties files like for OCR timeout to REST-API-Parameters if not yet done.
"tika-server now by default forks a process to isolate the parsing in the forked process (this was called the -spawnChild option in tika-1.x). Clients must now expect that tika-server will restart on OOM, timeouts, crashes or after parsing a large number of files. When this happens tika-server will restand and not receive connections for brief periods." Should work with the yet implemented "retry" strategy of the Open Semantic ETL Tika plugin.

Mandalka commented 2 years ago

From https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0 "Removed duplicate/triplicate keys

Background: In early 1.x, we had basic metadata keys that were created somewhat ad hoc. We then added metadata keys based on standards such as Dublin Core, or we at least tried to add namespaces to the metadata keys for specific file formats. To maintain backwards compatibility, we kept the old keys and added new keys. This led to quite a bit of metadata bloat, where we'd have the same information two or three times. In Tika 2.x, we slimmed down the metadata keys and relied only on the standards-based or name-spaced keys. In the table below, we document the mappings."

Mandalka commented 2 years ago

Following field name has to be renamed in Open Semantic ETL Tika Plugin:

X-Parsed-By to X-TIKA:Parsed-By

Seems following header (which we use for using our Tesseract OCR cache) doesn't work anymore in Tika 2.x:

X-Tika-OCRTesseractPath

Mandalka commented 2 years ago

Missing mapping in https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0 (fixed)

Tika 1.x: title Tika 2.x: dc:title

Mandalka commented 2 years ago

Merged https://github.com/opensemanticsearch/open-semantic-etl/pull/152

opensemanticsearch / open-semantic-etl

Upgrade to Tika 2.x #142