opensemanticsearch / open-semantic-etl

Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database
https://opensemanticsearch.org/etl
GNU General Public License v3.0
254 stars 69 forks source link

OCR of embedded images by Tika-Server #121

Closed opensemanticsearch closed 4 years ago

opensemanticsearch commented 4 years ago

Since in tika server we can now use cache and user dictionary by our tesseract wrapper Tesseract OCR Cache, do OCR of embedded images in PDF by Tika-Server instead of own plugin enhance_pdf_ocr, so Text from images is inline between/in content text instead of separated tab.

But do fallback/additional OCR by plugin enhance_pdf_ocr if Tika OCR throws exception like https://issues.apache.org/jira/projects/TIKA/issues/TIKA-3040, since there are documents where it is the case where extraction of images by libpoppler and tesseract works.

Mandalka commented 4 years ago

Enabled PDF inline OCR by Tika-Server and increased timeouts by "requestOptions" "timeout" for tika-python and by header X-Tika-OCRTimeout for Tika-Server.

Mandalka commented 4 years ago

Legacy PDF OCR plugin enhance_pdf_ocr now as fallback if OCR of PDF by Tika fails.