Open randomgambit opened 3 years ago
Hi Randomgambit, I have run tesseract in parallel on Windows and it seems to perform pretty well. I tested a 47 page pdf both with and without parallel processing. The function using parallel processing appears to be approximately 70% faster. I've included my code below.
Hope this is helpful!
parallel_ocr <- function(x) {
pdf_split <- as.list(pdftools::pdf_split(x, "./images/split/"))
cl <- makeCluster(detectCores())
clusterEvalQ(cl, {library(pdftools); library(tesseract)})
clusterExport(cl, c("pdf_convert", "ocr"))
png_file <- parLapplyLB(cl, pdf_split, pdf_convert, dpi = 150)
text <- parLapplyLB(cl, png_file, ocr)
}
Hello there,
Thanks for this amazing binding! I am running into some performance issues and I wonder if you have some hints or ideas.
Basically, the R wrapper works fine but it is very slow. I tried to use
furrr
and multiprocessing but I have read on the internet that it is not that easy to run many tesseract processing in parallel. Is that true? were you able to runtesseract
in parallel already?Thanks~