Ability to specify tessedit_pageseg_mode when using pdf_ocr_text

I have got exactly the same issue. I posted my solution to the problem. Unfortunately, I am fairly new on GitHub and don't know yet how to edit the functions in R packages and add them to the official repo. I might look into this in the coming weeks or months and add a) the possibility to forward the options to tesseract::ocr b) the pdf_ocr_text_2 function, with the multiple columns as default. Sidenote: the default value of tessedit_pageseg_mode is "6=block"; "1=auto+osd" The possible values can be found in tesseract:tesseract_params(filter = "tessedit_pageseg_mode").

General solution: pdf_ocr_text <- function(pdf, pages = NULL, opw = "", upw = "", dpi = 600, language = "eng", options = list()){ engine <- tesseract::tesseract(language = language, options = options) images <- pdf_convert(pdf = pdf, pages = pages, opw = opw, upw = upw, dpi = dpi) on.exit(unlink(images)) vapply(images, tesseract::ocr, character(1), engine = engine, USE.NAMES = FALSE) }

Solution for two columns: pdf_ocr_text_2 <- function(pdf, pages = NULL, opw = "", upw = "", dpi = 600, language = "eng", options = list(tessedit_pageseg_mode = 1)){ engine <- tesseract::tesseract(language = language, options = options) images <- pdf_convert(pdf = pdf, pages = pages, opw = opw, upw = upw, dpi = dpi) on.exit(unlink(images)) vapply(images, tesseract::ocr, character(1), engine = engine, USE.NAMES = FALSE) }

ropensci / pdftools

Ability to specify tessedit_pageseg_mode when using pdf_ocr_text #121