ropensci / tesseract

Bindings to Tesseract OCR engine for R
https://docs.ropensci.org/tesseract
244 stars 26 forks source link

Mass Converting PDF Files into Text #56

Closed swaheera closed 3 years ago

swaheera commented 3 years ago

I am trying to "mass upload" a large number of PDF files (these are "scanned pdf's") and convert them into text- currently, I figured out how to do this manually

library(pdftools)
library(tesseract)

#import and convert 1st file
   pngfile_1 <- pdftools::pdf_convert('myfile_1.pdf', dpi = 600)
    text_1 <- tesseract::ocr(pngfile_1)

#import and convert 2nd file (note: the files do not have the same naming convention)
   pngfile_2 <- pdftools::pdf_convert('second_file.pdf', dpi = 600)
    text_2 <- tesseract::ocr(pngfile_2)

etc

I copied/pasted the above code 50 times (while changing the "index", i.e. pngfile_i, text_i) and was able to accomplish what I wanted to do.

However, I am looking for a somewhat "automatic" to import and convert all the pdf files.

At the moment, all my pdf files are in the following folder:

"C:/Users/me/Documents/mypdfs" I found the following code which can be used to "mass import" pdf files into R:

library(dplyr)
library(data.table)

tbl_fread <- 
    list.files(pattern = "*.pdf") %>% 
    map_df(~fread(.))

But I am not sure how to instruct this code to import all pdf's from the correct directory ("C:/Users/me/Documents/mypdfs"). I also don't know how to instruct R to "rename" each imported pdf as "pdf_1, pdf_2, etc."

If all the pdf files were correctly imported and created, I could then write a "loop" and execute the desired commands, e.g.


# "n" would be the total number of pdf files 

for (i in 1:n)
{
pngfile_i <- pdftools::pdf_convert('myfile_i.pdf', dpi = 600)
text_i <- tesseract::ocr(pngfile_i)
}

Can someone please show me how to do this?

Thanks

jeroen commented 3 years ago

You don't need to use pdf_convert, you can also just pass the pdf file directly to tesseract_ocr(). So maybe you can do:

files <- list.files("C:/Users/me/Documents/mypdfs", pattern = ".pdf$", full.names = TRUE)
out <- lapply(files, tesseract::ocr)

Thanks for your question. This tracker is for reporting bugs and issues with the R package. General programming questions on how to write loops are better suited for stack overflow.