PDF OCR plugin: Use same tesseract options like tika-server

opensemanticsearch / open-semantic-etl

Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database

https://opensemanticsearch.org/etl

GNU General Public License v3.0

254 stars 69 forks source link

PDF OCR plugin: Use same tesseract options like tika-server #119

Closed opensemanticsearch closed 3 years ago

opensemanticsearch commented 4 years ago

Use same tesseract options like tika-server in enhance_pdf_ocr plugin so both can use same OCR cache results

Mandalka commented 3 years ago

I close this issue, since Tika seems to convert/optimize the image(s) before OCR, so input and hash not the same for same embedded image files and the plugin is not default anymore, since our new default settings using Tikas OCR now with our new tesseract-ocr-cache.