ropensci / tesseract

Bindings to Tesseract OCR engine for R
https://docs.ropensci.org/tesseract
245 stars 26 forks source link

tesseract very slow in R #54

Open randomgambit opened 3 years ago

randomgambit commented 3 years ago

Hello there,

Thanks for this amazing binding! I am running into some performance issues and I wonder if you have some hints or ideas.

Basically, the R wrapper works fine but it is very slow. I tried to use furrr and multiprocessing but I have read on the internet that it is not that easy to run many tesseract processing in parallel. Is that true? were you able to run tesseract in parallel already?

Thanks~

morgan-dgk commented 2 years ago

Hi Randomgambit, I have run tesseract in parallel on Windows and it seems to perform pretty well. I tested a 47 page pdf both with and without parallel processing. The function using parallel processing appears to be approximately 70% faster. I've included my code below.

Hope this is helpful!

parallel_ocr <- function(x) {
  pdf_split <- as.list(pdftools::pdf_split(x, "./images/split/"))
  cl <- makeCluster(detectCores())
  clusterEvalQ(cl, {library(pdftools); library(tesseract)})
  clusterExport(cl, c("pdf_convert", "ocr"))

  png_file <- parLapplyLB(cl, pdf_split, pdf_convert, dpi = 150)

  text <- parLapplyLB(cl, png_file, ocr)

}