Closed grimlinda56 closed 3 years ago
The sample or code didn't come through. It sounds like you tried using the parser config file, which uses Tika's xml format. More help and documentation are here:
Unfortunately I'm more familiar with R than the Tika XML format, so would try processing your PDFs in a differently configured tika:
For example, in R, get one character vector of the paths to the PDF files, and another to the non-PDF files. There are several ways, and one can use the base function list.files
on a directory with a pattern
regular expression. You might also make sure they are full paths using normalizePath(input, winslash = "/")
Run a tika function on the non-PDF vector with the following attribute:
config = system.file("extdata", "no-ocr.xml", package = "rtika")
For the PDF vector, the default is to use OCR.
If you come up with a Tika XML config file solution that works, please share and I can include it in later versions.
Thanks for responding. Will let you know what I find out. Right now (am using python lib) I have 2 tika servers running, one with OCR on and one with it off. I route PDF and Tif to the first, and everything else to the second. Less than satisfactory, but it seems to work.
I am trying to process a directory of diverse file types using tika. Some are PDFs that require OCR. I ONLY want to perform OCR on PDFs, and not the other files. Using the default configuration for the default parser, OCR is enabled and things get slower. But I cannot find a combination of parameters that will allow me to disable OCR for the default parser and enable it for the PDF parser. Once it is disabled for the default parser, I seem unable to allow it to be used in the PDF parser.
I have tried excluding OCR from the default parser and defining the inline strategy in the pdf parser config. I have excluded both the PDF parser and the tesseract parser from the default parser config. Any advice, or will I have to process my PDFs against a differently configured tika than the rest of my docs? Here is a sample of what I have tried.
<?xml version="1.0" encoding="UTF-8"?>