ropensci / tesseract

Bindings to Tesseract OCR engine for R
https://docs.ropensci.org/tesseract
245 stars 26 forks source link

Memory leak using ocr_data method #37

Closed JoelSPendery closed 5 years ago

JoelSPendery commented 5 years ago

I am reading in a number of pdfs, performing pre-processing and then getting the text from the ocr_data method (from a temp PNG file). During my loops to go through each pdf and to get all the pdfs in a directory, I am removing unused variables and performing garbage collection (gc()) to reduce the memory load of my program. However, as I continue to read in more and more pdfs, the memory load on my system keeps increasing, eventually running out of memory and throwing an error. I can load the library and comment out the ocr_data() method call in my script and the memory load doesn't appear to increase (minimum load is fairly consistent over a time period where I have been observing an increase). If I replace ocr_data() with ocr(), I don't see an increase in memory load on my system. This would suggest that whatever is causing this issue originates with the ocr_data() method (probably something isn't being deleted on the C++ side that might be related to saving the data frame values that are returned, but it's just a guess).

JoelSPendery commented 5 years ago

I found a work-around to this issue. Basically, I'm running a batch file within the script that kicks off a new instance of the script after I complete the OCR of a document. Immediately after kicking off the batch file from within the script, I run a quit statement to the previous instance of the script which releases the memory. It's not elegant, nor does it fix the root cause, but it works.

shell('C:/location/of/bat/file', wait=FALSE, translate = TRUE) q()

Batch file: "C:\Program Files\R\R-3.5.1\bin\Rscript.exe" "C:\location\of\r\script.R"

You will need to have some way that you search (and continue) through the documents that have already been run through OCR.

Oneiricer commented 5 years ago

Hey @JoelSPendery Looks like your solution works. I like how you literally thought 'outside the box'. I ran your method using batch files and saw that the quit command releases the memory. Rerunning the script will make it start from zero again. Not elegant but works, fine by me.

Gonna now try this run overnight and see how it deals with 30+ PDFs. Thanks once again. Oneiricer