nguyenq / tess4j

Java JNA wrapper for Tesseract OCR API
Apache License 2.0
1.58k stars 372 forks source link

Bad performance compared with direct use of Tesseract #225

Closed cesc6 closed 1 year ago

cesc6 commented 2 years ago

Hi, I'm getting a bad performance using Tess4j in comparision with a direct use of Tesseract in the same machine and same resources.

I'm using Tess4j-5.1.1 and I have Tesseract v5.0.0-alpha.20210811 installed on my pc (Windows 10 - i7-7600U CPU @ 2.80GHz 2.90 GHz - 16GB RAM)

Here you have what I'm doing:

public static void main(String[ ] args) throws Exception {
        long t0, tf;

        File scannedPdf = new File("C:\\Users\\francesc.sola\\Desktop\\work_tesseract\\1_page.jpg");
        ITesseract instance = new Tesseract();  // JNA Interface Mapping
        //ITesseract instance = new Tesseract1(); // JNA Direct Mapping

        System.out.println("Using Tess4j-5.1.1");
        t0 = System.currentTimeMillis();
        instance.doOCR(scannedPdf);
        tf = System.currentTimeMillis();
        System.out.println("Process time: " + (tf - t0) + " ms.");

        System.out.println("Direct call to tesseract v5.0.0-alpha.20210811");
        String command = "tesseract.exe C:\\Users\\francesc.sola\\Desktop\\work_tesseract\\1_page.jpg C:\\Users\\francesc.sola\\Desktop\\work_tesseract\\out";

        // Running the above command
        Runtime run = Runtime.getRuntime();
        t0 = System.currentTimeMillis();
        Process proc = run.exec(command);
        proc.waitFor();
        tf = System.currentTimeMillis();

        System.out.println("Process time: " + (tf - t0) + " ms.");
        run.exit(0);
    }

And here you have the output:

Using Tess4j-5.1.1
Process time: 11010 ms.
Direct call to tesseract v5.0.0-alpha.20210811
Process time: 6589 ms.

As you can see, the use of Tess4j is incrementing considerably the time of process in comparison of direct call to Tesseract.

Any ideas about this behaviour? I attached the image what I'm testing

1_page

Thanks!

nguyenq commented 2 years ago

Would Tesseract1 API provide higher speeds? Either way, going through JNA would incur some overhead.

But it is probably mainly due to the fact that the DLL was compiled not using Enhanced Instruction Set. It was done so to maintain maximum compatibility among several generations of CPU. You can build a DLL with Enhanced Instruction Set enabled to match your CPU's capability and set jna.library.path variable to load that instead.

nguyenq commented 2 years ago

Related https://github.com/nguyenq/tess4j/issues/95

cesc6 commented 2 years ago

If I execute the program through Tesseract1 API the results are better than the other but only for 1500-3000ms of difference. It still doesn't reach the same results as direct use of Tesseract. I will try to compile the DLL as you say and make some tests. Thanks for guide me.

cesc6 commented 2 years ago

After some time trying to compile the DLL with Enhanced Instruction Set enabled with no results because I have not been able to compile it, I don't understand why the problem is in Tesseract source if the latency appears when I use the Tess4j lib. In this case I'm using the same Tesseract version and configuration through Tess4j and through direct call. Could you check this?

Thanks.

nguyenq commented 2 years ago

Let me see if I can generate one with Enhanced Instruction Set enabled, and then you help test it.

nguyenq commented 2 years ago

With Advanced Vector Extensions 2 (/arch:AVX2) enabled, a 5% improvement in speed was observed.

cesc6 commented 2 years ago

Thanks @nguyenq for the tests. A 5% is a little bit improvement related to the times I'm getting on my tests, don't you think so?

Using Tess4j-5.1.1
Process time: 11010 ms.
Direct call to tesseract v5.0.0-alpha.20210811
Process time: 6589 ms.

Could you make a test calling to Tesseract through Tess4j and another test calling to Tesseract directly?

cesc6 commented 2 years ago

Hi @nguyenq, have you been able to try something about this?

Thanks in advance.

nguyenq commented 2 years ago

I noticed during testing, my numbers using Tesseract executable binary were significantly slower than yours. Evidently, the builds were not optimized. I suggest you perform a custom build following official build instructions and test with that.

https://tesseract-ocr.github.io/tessdoc/#compiling-and-installation

Please keep us posted of your results. Thanks.

cesc6 commented 2 years ago

Thanks @nguyenq, I'm sorry to insist but I don't understand why the problem is in Tesseract compilation if the latency appears when I use the Tess4j lib. Although the Tesseract build are not optimized, don't you think the times must be similar on the two calls?

nguyenq commented 2 years ago

I ran Process proc = run.exec(command); snippet on the .exe version I built, and it was significantly slower than your version. That's why I suggest you to try the build process they use at https://github.com/UB-Mannheim/tesseract/wiki.

cesc6 commented 2 years ago

I built a optimized version for Linux and Windows systems and the results was the same. What gets if executes my code?

public static void main(String[ ] args) throws Exception {
        long t0, tf;

        File scannedPdf = new File("C:\\Users\\francesc.sola\\Desktop\\work_tesseract\\1_page.jpg");
        ITesseract instance = new Tesseract();  // JNA Interface Mapping
        //ITesseract instance = new Tesseract1(); // JNA Direct Mapping

        System.out.println("Using Tess4j-5.1.1");
        t0 = System.currentTimeMillis();
        instance.doOCR(scannedPdf);
        tf = System.currentTimeMillis();
        System.out.println("Process time: " + (tf - t0) + " ms.");

        System.out.println("Direct call to tesseract v5.0.0-alpha.20210811");
        String command = "tesseract.exe C:\\Users\\francesc.sola\\Desktop\\work_tesseract\\1_page.jpg C:\\Users\\francesc.sola\\Desktop\\work_tesseract\\out";

        // Running the above command
        Runtime run = Runtime.getRuntime();
        t0 = System.currentTimeMillis();
        Process proc = run.exec(command);
        proc.waitFor();
        tf = System.currentTimeMillis();

        System.out.println("Process time: " + (tf - t0) + " ms.");
        run.exit(0);
    }

Can show your output?

nguyenq commented 2 years ago

This is my output on Windows system with Ryzen 7 5800X and 32GB RAM.

Using Tess4j-5.1.2
Process time: 2775 ms.
Direct call to tesseract v5.1.0
Process time: 4162 ms.
nguyenq commented 2 years ago

Running with tesseract.exe from https://github.com/UB-Mannheim/tesseract/wiki:

Direct call to tesseract v5.0.1
Process time: 1989 ms.
cesc6 commented 2 years ago

This is a good results of performance! Many thanks for your tests. I will try to make some changes and tests based on what you says.