tesseract very slow in R

ropensci / tesseract

Bindings to Tesseract OCR engine for R

245 stars 26 forks source link

Hi Randomgambit, I have run tesseract in parallel on Windows and it seems to perform pretty well. I tested a 47 page pdf both with and without parallel processing. The function using parallel processing appears to be approximately 70% faster. I've included my code below.

Hope this is helpful!

parallel_ocr <- function(x) {
  pdf_split <- as.list(pdftools::pdf_split(x, "./images/split/"))
  cl <- makeCluster(detectCores())
  clusterEvalQ(cl, {library(pdftools); library(tesseract)})
  clusterExport(cl, c("pdf_convert", "ocr"))

  png_file <- parLapplyLB(cl, pdf_split, pdf_convert, dpi = 150)

  text <- parLapplyLB(cl, png_file, ocr)

}

ropensci / tesseract

tesseract very slow in R #54