tw4l / bulk-reviewer

DEPRECATED. Replaced with Electron desktop application: https://github.com/bulk-reviewer/bulk-reviewer
GNU Affero General Public License v3.0
13 stars 1 forks source link

Problem: Tika jar not extracting text from scanned PDFs #59

Open tw4l opened 5 years ago

tw4l commented 5 years ago

See Tika configuration: https://tika.apache.org/1.19.1/gettingstarted.html

And relevant Jira issue: https://issues.apache.org/jira/browse/TIKA-1729

tw4l commented 5 years ago

Issue seems to be with libraries for reading JPEG2000 and TIFF images. See: https://pdfbox.apache.org/2.0/dependencies.html#optional-components

Look into licensing issues before making any changes (these libraries are no longer included with Tika by default due to licensing)

tw4l commented 5 years ago

More information on licensing: https://github.com/jai-imageio/jai-imageio-core

tw4l commented 5 years ago

Possible alternative: https://github.com/geosolutions-it/imageio-ext/

tw4l commented 5 years ago

Other alternative: replace with Textract