Selectively using tesseract with tika

grimlinda56 commented 4 years ago

I am trying to process a directory of diverse file types using tika. Some are PDFs that require OCR. I ONLY want to perform OCR on PDFs, and not the other files. Using the default configuration for the default parser, OCR is enabled and things get slower. But I cannot find a combination of parameters that will allow me to disable OCR for the default parser and enable it for the PDF parser. Once it is disabled for the default parser, I seem unable to allow it to be used in the PDF parser.

I have tried excluding OCR from the default parser and defining the inline strategy in the pdf parser config. I have excluded both the PDF parser and the tesseract parser from the default parser config. Any advice, or will I have to process my PDFs against a differently configured tika than the rest of my docs? Here is a sample of what I have tried.
<?xml version="1.0" encoding="UTF-8"?>

true

goodmansasha commented 4 years ago

The sample or code didn't come through. It sounds like you tried using the parser config file, which uses Tika's xml format. More help and documentation are here:

I suggest asking here: https://issues.apache.org/jira/projects/TIKA
Limited instructions: https://tika.apache.org/1.24.1/configuring.html
The Tika Wiki has a few pages on the config file : https://cwiki.apache.org/confluence/display/tika/

Unfortunately I'm more familiar with R than the Tika XML format, so would try processing your PDFs in a differently configured tika:

For example, in R, get one character vector of the paths to the PDF files, and another to the non-PDF files. There are several ways, and one can use the base function list.files on a directory with a pattern regular expression. You might also make sure they are full paths using normalizePath(input, winslash = "/")

Run a tika function on the non-PDF vector with the following attribute: config = system.file("extdata", "no-ocr.xml", package = "rtika")

For the PDF vector, the default is to use OCR.

If you come up with a Tika XML config file solution that works, please share and I can include it in later versions.

grimlinda56 commented 4 years ago

Thanks for responding. Will let you know what I find out. Right now (am using python lib) I have 2 tika servers running, one with OCR on and one with it off. I route PDF and Tif to the first, and everything else to the second. Less than satisfactory, but it seems to work.

ropensci / rtika

Selectively using tesseract with tika #13