ropensci / rtika

R Interface to Apache Tika
https://docs.ropensci.org/rtika
Apache License 2.0
54 stars 8 forks source link

Selectively using tesseract with tika #13

Closed grimlinda56 closed 3 years ago

grimlinda56 commented 4 years ago

I am trying to process a directory of diverse file types using tika. Some are PDFs that require OCR. I ONLY want to perform OCR on PDFs, and not the other files. Using the default configuration for the default parser, OCR is enabled and things get slower. But I cannot find a combination of parameters that will allow me to disable OCR for the default parser and enable it for the PDF parser. Once it is disabled for the default parser, I seem unable to allow it to be used in the PDF parser.

I have tried excluding OCR from the default parser and defining the inline strategy in the pdf parser config. I have excluded both the PDF parser and the tesseract parser from the default parser config. Any advice, or will I have to process my PDFs against a differently configured tika than the rest of my docs? Here is a sample of what I have tried.
<?xml version="1.0" encoding="UTF-8"?>

true
goodmansasha commented 4 years ago

The sample or code didn't come through. It sounds like you tried using the parser config file, which uses Tika's xml format. More help and documentation are here:

Unfortunately I'm more familiar with R than the Tika XML format, so would try processing your PDFs in a differently configured tika:

For example, in R, get one character vector of the paths to the PDF files, and another to the non-PDF files. There are several ways, and one can use the base function list.files on a directory with a pattern regular expression. You might also make sure they are full paths using normalizePath(input, winslash = "/")

Run a tika function on the non-PDF vector with the following attribute: config = system.file("extdata", "no-ocr.xml", package = "rtika")

For the PDF vector, the default is to use OCR.

If you come up with a Tika XML config file solution that works, please share and I can include it in later versions.

grimlinda56 commented 4 years ago

Thanks for responding. Will let you know what I find out. Right now (am using python lib) I have 2 tika servers running, one with OCR on and one with it off. I route PDF and Tif to the first, and everything else to the second. Less than satisfactory, but it seems to work.