Closed opensemanticsearch closed 4 years ago
Enabled PDF inline OCR by Tika-Server and increased timeouts by "requestOptions" "timeout" for tika-python and by header X-Tika-OCRTimeout for Tika-Server.
Legacy PDF OCR plugin enhance_pdf_ocr now as fallback if OCR of PDF by Tika fails.
Since in tika server we can now use cache and user dictionary by our tesseract wrapper Tesseract OCR Cache, do OCR of embedded images in PDF by Tika-Server instead of own plugin enhance_pdf_ocr, so Text from images is inline between/in content text instead of separated tab.
But do fallback/additional OCR by plugin enhance_pdf_ocr if Tika OCR throws exception like https://issues.apache.org/jira/projects/TIKA/issues/TIKA-3040, since there are documents where it is the case where extraction of images by libpoppler and tesseract works.