nguyenq / tess4j

Java JNA wrapper for Tesseract OCR API
Apache License 2.0
1.58k stars 372 forks source link

Tesseract CLI gives different results (bounding boxes, text) than Tess4J when creating searchable PDF #266

Closed goldfish578hoodlum closed 2 months ago

goldfish578hoodlum commented 3 months ago

Searchable PDF output between Tesseract-OCR 5.3.4 CLI and tess4j-5.11.0 are different.

Searchable PDF created with Tesseract-OCR CLI:

tesseract BostonPermit.tif BostonPermit-tesseract-5.3.4-cli -v --tessdata-dir "C:\Program Files\Tesseract-OCR\tessdata" -c tessedit_create_pdf=1

tesseract v5.3.4.20240503
 leptonica-1.84.1
  libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 3.0.1) : libpng 1.6.43 : libtiff 4.6.0 : zlib 1.3 : libwebp 1.4.0 : libopenjp2 2.5.2
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found libarchive 3.7.4 zlib/1.3.1 liblzma/5.6.1 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.6
 Found libcurl/8.7.1 Schannel zlib/1.3.1 brotli/1.1.0 zstd/1.5.6 libidn2/2.3.7 libpsl/0.21.5 libssh2/1.11.0
Page 1
Page 2
Page 3
Page 4
Page 5

Searchable PDF created with Tess4j-5.11.0:

package net.sourceforge.tess4j.example;

import java.util.Arrays;

import net.sourceforge.tess4j.ITessAPI.TessPageSegMode;
import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.ITesseract.RenderedFormat;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;

public class CreateSearchablePdfExample {

    public static void main(String[] args) {
        ITesseract itess = new Tesseract();
        itess.setDatapath("C:\\Program Files\\Tesseract-OCR\\tessdata");
        itess.setLanguage("eng");
        itess.setPageSegMode(TessPageSegMode.PSM_AUTO);
        try {
            itess.createDocuments(
                "BostonPermit.tif", "BostonPermit-tess4j-5.11.0",
                Arrays.asList(RenderedFormat.PDF));
        }
        catch (TesseractException ex) {
            ex.printStackTrace();
        }
    }
}

materials.zip

Opening both searchable PDFs in Acrobat and searching for term "permit" shows the bounding box for Tesseract-OCR output surrounds all pixels of the word, unlike tess4j which excludes the trailing letter 't'.

Tesseract-OCR 5.3.4 tess4j-5.11.0
image image

Are you able to reproduce these results?

nguyenq commented 2 months ago

In my investigation, the text looks correct, but not the bounding boxes, which appear a few pixels too narrow.

The problem seems to have been fixed in Tesseract-OCR 5.4.0, my testing with the .NET version indicated. I've been trying, without success, to generate an updated DLL that would work with Java without invalid memory access exceptions. Recent VS2022 updates might have broken the builds.

nguyenq commented 2 months ago

Fixed by commit bae35f5045e399c344b986da5835e4db3448eb5d