nguyenq / tess4j

Java JNA wrapper for Tesseract OCR API
Apache License 2.0
1.6k stars 373 forks source link

Tesseract CLI gives better results than Tess4J #264

Closed JoachimUnger closed 5 months ago

JoachimUnger commented 5 months ago

I am using Tesseract 5.3.4

tesseract v5.3.4.20240503 leptonica-1.84.1 libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 3.0.1) : libpng 1.6.43 : libtiff 4.6.0 : zlib 1.3 : libwebp 1.4.0 : libopenjp2 2.5.2 Found AVX2 Found AVX Found FMA Found SSE4.1 Found libarchive 3.7.4 zlib/1.3.1 liblzma/5.6.1 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.6 Found libcurl/8.7.1 Schannel zlib/1.3.1 brotli/1.1.0 zstd/1.5.6 libidn2/2.3.7 libpsl/0.21.5 libssh2/1.11.0

In the tessdata is the best deu.traineddata.

"c:\Program Files\Tesseract-OCR\tesseract" test2.png output2 -l deu

results in output 'ZAUN'.

package net.sourceforge.tess4j.example;

import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.Tesseract1;
import net.sourceforge.tess4j.TesseractException;
import net.sourceforge.tess4j.util.LoadLibs;

import java.io.File;
import java.net.URISyntaxException;
import java.nio.file.Path;
import java.nio.file.Paths;

public class TesseractExample {

    public static void main(String[] args) {
        File imageFile = new File("K:/IdeaWorkspace/Tess4J/test2.png");
        ITesseract instance = new Tesseract();  // JNA Interface Mapping
        instance.setLanguage("deu");

        try {
            instance.setDatapath("C:\\Program Files\\Tesseract-OCR\\tessdata");
            String result = instance.doOCR(imageFile);
            System.out.println(result);
        } catch (TesseractException e) {
            System.err.println(e.getMessage());
        }
    }
}

Result is 'ZALUM'.

Does the tesseract.exe more processing? Or are there fidderent internal settings?

test2

nguyenq commented 5 months ago

Duplicate of https://github.com/nguyenq/tess4j/issues/261 and https://github.com/nguyenq/tess4j/issues/259

JoachimUnger commented 5 months ago

With the latest version VietOCR 6.13.1 and the best traineddata I get 'ZALUM'. OcrEngineMode=1. PageSegMode=7. The default PageSegMode 3 gives no result.

nguyenq commented 5 months ago

image

JoachimUnger commented 5 months ago

The important difference is that my PNG had 8bpp and yours 24 bpp.

So it works!

grafik

nguyenq commented 5 months ago

@JoachimUnger Acknowledged. I right clicked on the attached image, copied and pasted it into VietOCR UI, and got the results.

However, if I saved the image to the local drive, loaded it in the program, and performed OCR on it, it would produce blank output. When I applied either the grayscale or the monochrome filter, I got good output again.

It's possible or likely that Tesseract CLI performs some basic image preprocessing before OCR stage. You may have to perform similar preprocessing yourself when using tess4j.

nguyenq commented 5 months ago

We may need to debug and trace through the native code to determine what preprocessing is performed for this kind of image.

nguyenq commented 5 months ago

@JoachimUnger Tesseract OCR engine did not perform any preprocessing on this image. The CLI has used TextRenderer, not GetUTF8Text, which doOCR calls, to create the output text file. If you used the renderer in your program, you'd get the expected matching results. You can verify by using VietOCR's Bulk OCR function, which uses the renderers.