OCR of embedded images by Tika-Server

opensemanticsearch commented 4 years ago

Since in tika server we can now use cache and user dictionary by our tesseract wrapper Tesseract OCR Cache, do OCR of embedded images in PDF by Tika-Server instead of own plugin enhance_pdf_ocr, so Text from images is inline between/in content text instead of separated tab.

But do fallback/additional OCR by plugin enhance_pdf_ocr if Tika OCR throws exception like https://issues.apache.org/jira/projects/TIKA/issues/TIKA-3040, since there are documents where it is the case where extraction of images by libpoppler and tesseract works.

Mandalka commented 4 years ago

Enabled PDF inline OCR by Tika-Server and increased timeouts by "requestOptions" "timeout" for tika-python and by header X-Tika-OCRTimeout for Tika-Server.

Mandalka commented 4 years ago

Legacy PDF OCR plugin enhance_pdf_ocr now as fallback if OCR of PDF by Tika fails.

opensemanticsearch / open-semantic-etl

OCR of embedded images by Tika-Server #121