Closed edavidaja closed 5 years ago
Thanks for this! I’ll sort this out for the next release.
Hi, I would like to add I am having the same problem as edavidaja. However running his tika-config.xml code did not solve my problem. I am new to R and do not have any Java experience, however I suspect the issue I have is due to not having tika installed (i only have your rtika package installed).
What command/syntax do i enter in order to 'force' rtika to do OCR?
After looking into this, I think the OCR only works on Linux. I've tried on OS X and Windows.
@edavidaja I think this config file should be the default. Apparently, very large PDF files can cause memory problems with tesseract, which is apparently why it was switched off. However, the OCR should be consistently turned on. I've made PDF OCR the default behavior, and also included a config file that could be used instead to turn off OCR. This is in the latest version of rtika here.
devtools::install_github("ropensci/rtika")
This may very well be a "works on my machine" situation but I'm finding that some config is required to get tika to OCR as part of the batch process even after tesseract is installed.
Created on 2018-12-14 by the reprex package (v0.2.1)