vaites / php-apache-tika

Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
MIT License
116 stars 22 forks source link

disable ocr #7

Closed Alexmg86 closed 6 years ago

Alexmg86 commented 6 years ago

Hi! Thanks by this package I don't understand how to disable OCR, because i have a many errors in log file

vaites commented 6 years ago

@Alexmg86, I don't understand the problem... why do you want to disable OCR? what Tika methods are you using.

About the errors, can you attach the log?

Alexmg86 commented 6 years ago

@vaites Hi! Thanks for your answer! i'm just using getText() method. And have a warning logs. In my project i'm using other OCR system from Abbyy and thinking that tesseract more slowly for CPU performance :) Below you can see my log.

íîÿá. 27, 2017 11:10:22 ÄÏ org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: JBIG2ImageReader not loaded. jbig2 files will be ignored See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies. TIFFImageWriter not loaded. tiff files will not be processed See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies. J2KImageReader not loaded. JPEG2000 files will not be processed. See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies.

íîÿá. 27, 2017 11:10:22 ÄÏ org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: org.xerial's sqlite-jdbc is not loaded. Please provide the jar on your classpath to parse sqlite files. See tika-parsers/pom.xml for the correct version. WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.apache.tika.parser.ParseContext (file:/C:/www/trunk/public_html/eliseen/public/vendor/tika/tika-app-1.16.jar) to method com.sun.org.apache.xerces.internal.util.SecurityManager.setEntityExpansionLimit(int) WARNING: Please consider reporting this to the maintainers of org.apache.tika.parser.ParseContext WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release INFO As a convenience, TikaCLI has turned on extraction of inline images for the PDFParser (TIKA-2374). This is not the default option in Tika generally or in tika-server. As a convenience, TikaCLI has turned on extraction of inline images for the PDFParser (TIKA-2374). This is not the default option in Tika generally or in tika-server.

vaites commented 6 years ago

Thanks @Alexmg86, but I think we can't do anything at the client side. You must configure Tika to disable OCR using tika.xml: https://wiki.apache.org/tika/TikaOCR

Alexmg86 commented 6 years ago

Many thanks, i was seen this solution, but don't understand where need save this file on windows :) And after i try ask to you )) Ok, will be try again