Open milen-dimitrov opened 1 year ago
The memory leak issue has been reported several times, but we have no way to address it. The Java binding is just a thin Java layer over Tesseract C-API. The native code seems to tend to spring memory leaks in multithreaded applications.
I've encounter this memory leak a few weeks ago and I've managed to identify it only occurs when doing parallel OCR processing using tess4j within a docker container.
When running my container the java heap and native memory remain stable but the RAM usage by the container is increasing.
To reproduce this leak I'm iterating PDF files and for each PDF file I create 4-thread pool: ExecutorService executor = Executors.newFixedThreadPool(4)
Each of the 4 threads is processing one page at a time. For each page a Tesseract() instance is created and the tesseract.doOCR(pageImage) method is used to do the OCR. When the processing of the PDF file finishes I close my thread pool using executor.shutdownNow()
I've managed to circumvent the leak if I make my thread pool static and I never shutdown my threads. I only reuse them. This doesn't lead to an ever increasing RAM usage but I don't think recreating the thread pool and then shutting it down should be an issue.
If I run my code outside of the docker container, there is no memory leakage. If I run my code in the container but using only one thread there is no memory leak either.
I made a git repository with a sample java project to illustrate and reproduce the leak. Just build and run the docker image: https://github.com/milen-dimitrov/TessMemoryLeakSample
There are also these message that may mean something. I get them when I interrupt my program. https://github.com/milen-dimitrov/TessMemoryLeakSample/blob/main/Screenshot_20230408_200926.png?raw=true