opensemanticsearch / open-semantic-search

Open Source research tool to search, browse, analyze and explore large document collections by Semantic Search Engine and Open Source Text Mining & Text Analytics platform (Integrates ETL for document processing, OCR for images & PDF, named entity recognition for persons, organizations & locations, metadata management by thesaurus & ontologies, search user interface & search apps for fulltext search, faceted search & knowledge graph)
https://opensemanticsearch.org
GNU General Public License v3.0
979 stars 169 forks source link

JPEG2000 in pdf #475

Open danteand opened 9 months ago

danteand commented 9 months ago

On a debian install, I get a lot of errors when indexing pdfs indicating JPEG2000 images cannot be read.

java[1179714]: ERROR [qtp1335914322-72] 14:16:52,117 org.apache.pdfbox.contentstream.PDFStreamEngine Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed

I put libraries in /opt/solr-8.11.1/contrib/extraction/lib

jai-imageio-core-1.4.0.jar jai-imageio-jpeg2000-1.4.0.jar

but still have the errors. Has anyone had success getting the JPEG2000 images read in pdfs during extraction?