ropensci / pdftools

Text Extraction, Rendering and Converting of PDF Documents
https://docs.ropensci.org/pdftools
Other
513 stars 69 forks source link

Ability to specify tessedit_pageseg_mode when using pdf_ocr_text #121

Closed werkstattcodes closed 1 year ago

werkstattcodes commented 1 year ago

This is more a suggestion than an issue: It would be very handy if pdf_ocr_text were to provide for the option to define tesseract's tessedit_pageseg_mode. Unless, I am mistaken this is currently not possible, and hence when needing a specific attribute value, one has to to first use pdftools::pdf_convert and then tesseract::ocr(., tessedit_pageseg_mode=..). While these steps can be easily wrapped in one function, it would be nice if pdf_ocr_textwould provide the option 'out of the box', particularly since the dpi and language attribute values are already forwarded. Thank you.

nriemenschneider commented 1 year ago

I have got exactly the same issue. I posted my solution to the problem. Unfortunately, I am fairly new on GitHub and don't know yet how to edit the functions in R packages and add them to the official repo. I might look into this in the coming weeks or months and add a) the possibility to forward the options to tesseract::ocr b) the pdf_ocr_text_2 function, with the multiple columns as default. Sidenote: the default value of tessedit_pageseg_mode is "6=block"; "1=auto+osd" The possible values can be found in tesseract:tesseract_params(filter = "tessedit_pageseg_mode").

General solution: pdf_ocr_text <- function(pdf, pages = NULL, opw = "", upw = "", dpi = 600, language = "eng", options = list()){ engine <- tesseract::tesseract(language = language, options = options) images <- pdf_convert(pdf = pdf, pages = pages, opw = opw, upw = upw, dpi = dpi) on.exit(unlink(images)) vapply(images, tesseract::ocr, character(1), engine = engine, USE.NAMES = FALSE) }

Solution for two columns: pdf_ocr_text_2 <- function(pdf, pages = NULL, opw = "", upw = "", dpi = 600, language = "eng", options = list(tessedit_pageseg_mode = 1)){ engine <- tesseract::tesseract(language = language, options = options) images <- pdf_convert(pdf = pdf, pages = pages, opw = opw, upw = upw, dpi = dpi) on.exit(unlink(images)) vapply(images, tesseract::ocr, character(1), engine = engine, USE.NAMES = FALSE) }