Closed werkstattcodes closed 1 year ago
I have got exactly the same issue. I posted my solution to the problem. Unfortunately, I am fairly new on GitHub and don't know yet how to edit the functions in R packages and add them to the official repo. I might look into this in the coming weeks or months and add a) the possibility to forward the options to tesseract::ocr b) the pdf_ocr_text_2 function, with the multiple columns as default. Sidenote: the default value of tessedit_pageseg_mode is "6=block"; "1=auto+osd" The possible values can be found in tesseract:tesseract_params(filter = "tessedit_pageseg_mode").
General solution:
pdf_ocr_text <- function(pdf, pages = NULL, opw = "", upw = "", dpi = 600, language = "eng", options = list()){ engine <- tesseract::tesseract(language = language, options = options) images <- pdf_convert(pdf = pdf, pages = pages, opw = opw, upw = upw, dpi = dpi) on.exit(unlink(images)) vapply(images, tesseract::ocr, character(1), engine = engine, USE.NAMES = FALSE) }
Solution for two columns:
pdf_ocr_text_2 <- function(pdf, pages = NULL, opw = "", upw = "", dpi = 600, language = "eng", options = list(tessedit_pageseg_mode = 1)){ engine <- tesseract::tesseract(language = language, options = options) images <- pdf_convert(pdf = pdf, pages = pages, opw = opw, upw = upw, dpi = dpi) on.exit(unlink(images)) vapply(images, tesseract::ocr, character(1), engine = engine, USE.NAMES = FALSE) }
This is more a suggestion than an issue: It would be very handy if
pdf_ocr_text
were to provide for the option to define tesseract'stessedit_pageseg_mode
. Unless, I am mistaken this is currently not possible, and hence when needing a specific attribute value, one has to to first usepdftools::pdf_convert
and thentesseract::ocr(., tessedit_pageseg_mode=..)
. While these steps can be easily wrapped in one function, it would be nice ifpdf_ocr_text
would provide the option 'out of the box', particularly since the dpi and language attribute values are already forwarded. Thank you.