Tess4j does not correctly handle images with alpha channel

waljohn commented 1 year ago

If an image has a alpha channel (regardless of if this image has actual transparent pixels or not) the OCR output is empty.

I had one tiff image that wouldn't OCR and it took me QUITE a long time of trial and error to figure out why this one file wouldn't OCR and other seemingly identical ones would When calling tesseract(.exe) directly: the image is correctly OCR'ed.

Tess4J should either throw an exception or do the OCR.

Minimum SSCCE

import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;

import javax.imageio.ImageIO;
import javax.imageio.ImageReader;
import javax.imageio.stream.ImageInputStream;
import java.awt.*;
import java.awt.image.BufferedImage;
import java.io.IOException;
import java.io.UncheckedIOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.Objects;

public class Tess4JTransparentTiffErrPOC {

    private static final String PATH_TO_TESS_DATA = "c:\\opt\\lib\\tessdata-4.1.0";
    private static final Path transparencyTiff = Path.of(Objects.requireNonNull(System.getenv("USERPROFILE"))).resolve("Desktop")
            .resolve("wikipedia_bio_margret_ives_abbot.tiff");

    public static void main(String[] args)  {

        ITesseract instance = new Tesseract();
        instance.setDatapath(PATH_TO_TESS_DATA);

        try {
            String result = instance.doOCR(transparencyTiff.toFile());
            System.out.printf("Unmodified input OCR is length %d:%n", result.length());
            System.out.println(result);

            //now flatten the image
            BufferedImage img = flattenImage(transparencyTiff);

            result = instance.doOCR(img);
            System.out.printf("Flattened input OCR is length %d:%n", result.length());
            System.out.println(result);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    /** @noinspection SameParameterValue*/
    private static BufferedImage flattenImage(Path path) {
        final ImageReader imageReader = ImageIO.getImageReadersBySuffix("tiff").next();
        try (ImageInputStream is = ImageIO.createImageInputStream(Files.newInputStream(path))) {
            imageReader.setInput(is);
            final BufferedImage pageImage = imageReader.read(0);
            final BufferedImage flattened = new BufferedImage(pageImage.getWidth(), pageImage.getHeight(), BufferedImage.TYPE_INT_RGB);
            Graphics2D graphics = flattened.createGraphics();
            graphics.setColor(Color.WHITE);
            graphics.fillRect(0, 0, flattened.getWidth(), flattened.getHeight());
            graphics.drawImage(pageImage, 0, 0, null);
            graphics.dispose();
            return flattened;
        } catch (IOException ioe) {
            throw new UncheckedIOException(ioe);
        }
    }
}

wikipedia_bio_margret_ives_abbot.tiff.gz

nguyenq commented 1 year ago

You'd need to preprocess such images. We applied the monochrome filter in VietOCR, which uses Tess4J, and were able to get the OCR text.

waljohn commented 1 year ago

In the example, flattenImage basically "preprocesses" the image.

Without performing that operation: VietOCR also fails to produce any OCR text

So I believe that behavior is exhibiting the same bug

nguyenq commented 1 year ago

Since Tesseract probably has this preprocessing step when reading an image, you'll need to do the same in your Java code as Tess4J wrapper does not include any image preprocessing; it only reads and sends image data to the engine.

nguyenq commented 1 year ago

https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md#transparency--alpha-channel

waljohn commented 1 year ago

That's for version 3.

This is version 5.

https://github.com/nguyenq/tess4j/tree/master/src/main/resources/win32-x86-64

nguyenq commented 1 year ago

@waljohn Please submit a PR.

nguyenq commented 3 months ago

We may need to debug and trace through the native code to determine what preprocessing is performed for this kind of image.

nguyenq commented 3 months ago

@waljohn This issue is identical as https://github.com/nguyenq/tess4j/issues/264, we found out.

Tesseract OCR engine did not perform any special preprocessing on this image. The CLI has used TextRenderer, not GetUTF8Text, which doOCR calls, to create the output text file. If you used the renderer in your program, you'd get the expected matching results. You can verify by using VietOCR's Bulk OCR function, which uses the renderers.

nguyenq / tess4j

Tess4j does not correctly handle images with alpha channel #249