feat: Improved tesseract OCR performance

popey commented 1 month ago

What would you like to be added:

Improved text recognition within screenshots.

Why is this needed:

Tesseract is pretty great, but sometimes doesn't recognise text on screenshots. We already scale the screenshot up 3x before running tesseract-ocr on it. That improved text recognition tremendously. But I think there's more we could do.

Additional context:

It's possible to train tesseract to create our own dataset. Is that worthwhile? Is it worth training tesseract on the Ubuntu font for example?

ali1234 commented 1 month ago

Recognizing text in screenshots is, I think, a fundamentally different problem to OCR, unless the text is huge. Unfortunately it usually isn't in screenshots: (#29).

Tesseract is also seems very sensitive to small changes in the input: (#19, #30). This is likely a result of the way it performs binarization. Doing this in a smarter way, one that is optimized for screenshots rather than scans and photographs, would probably give better improvements than training on the Ubuntu font.

popey commented 1 month ago

I agree, it's not optimal.

So far though, it seems the main issue is text with a drop-shadow on a coloured background. Title bars and text within windows, where the background is a uniform colour, tesseract has better success with.

As a test, I thought I'd try a few things, including the following to see if tesseract had a better time.

Scaling the screenshot up to ludicrous resolutions
Iterating through all the tesseract options (supported in the build I use)
Chopping the image up into sections (e.g. 2x2, 4x4)

Here's a script I threw together to test on the problematic "Install Xubuntu" example @ali1234 gave in #29

#!/bin/bash
# Takes two parameters, one is an image file and the other is a text string in quotes
# We have a bunch of nested loops which iterate over the various tesseract ocr options
# running each in turn, and producing a text file with the output of each run
# We grep for the text string in the resulting text that tesseract produces

# The first parameter is the image file to be processed
image=$1
imagename=$(basename $image)

# The second parameter is the text string to search for
text=$2

# The number of tiles to split the image into both horizontally and vertically
tiles=4

# Create a temporary directory to store the output of the tesseract runs which is datestamped
workdir=$(date +%Y%m%d-%H%M%S)

mkdir $workdir

for x in $(seq 0 $((tiles-1))); do
    for y in $(seq 0 $((tiles-1))); do
        for scale in 100 200 300 400 500; do
            scaledimage="$workdir"/scaled-"$scale"-"$imagename"
            convert "$image" -resize "$scale"% "$scaledimage"
            width=$(identify -format "%w" "$scaledimage")
            height=$(identify -format "%h" "$scaledimage")
            tilewidth=$((width/tiles))
            tileheight=$((height/tiles))
            xoffset=$((x*tilewidth))
            yoffset=$((y*tileheight))
            tileimage="$workdir"/tile-"$x"-"$y"-"$scale"-"$imagename"
            convert "$scaledimage" -crop "$tilewidth"x"$tileheight"+"$xoffset"+"$yoffset" "$tileimage"
            for psm in 4 5 6 7 8 9 10 11 12 13; do
                for oem in 3; do
                    echo "Running tesseract with scale $scale%, psm $psm, oem $oem on tile $x $y"
                    if ! tesseract --loglevel ALL -c tessedit_write_images=true --psm "$psm" --oem "$oem" "$tileimage" $workdir/out-tile-"$x"-"$y"-"$scale"-"$psm"-"$oem"; then
                        echo "tesseract failed"
                        exit 1
                    fi
                    grep -q "$text"  $workdir/out-tile-$x-$y-$scale-$psm-$oem.txt
                    if [ $? -eq 0 ]; then
                        echo "Found $text in $workdir/out-tile-$x-$y-$scale-$psm-$oem.txt"
                    fi
                done
            done
        done
    done
done

What seemed to work best was chopping the image up. Now, this won't work every time, especially if the text we're looking for is cut in half horizontally or vertically on the split.

grep Xubuntu 20240519-143916/*.txt
20240519-143916/out-tile-0-1-100-11-3.txt:Install Xubuntu
20240519-143916/out-tile-0-1-100-12-3.txt:Install Xubuntu
20240519-143916/out-tile-0-1-100-4-3.txt:Install Xubuntu
20240519-143916/out-tile-0-1-100-6-3.txt:Install Xubuntu
20240519-143916/out-tile-0-1-200-11-3.txt:Install Xubuntu
20240519-143916/out-tile-0-1-200-12-3.txt:Install Xubuntu
20240519-143916/out-tile-0-1-200-4-3.txt:Install Xubuntu
20240519-143916/out-tile-0-1-200-6-3.txt:Install Xubuntu
20240519-143916/out-tile-0-1-300-11-3.txt:Install Xubuntu
20240519-143916/out-tile-0-1-300-12-3.txt:Install Xubuntu
20240519-143916/out-tile-0-1-300-4-3.txt:Install Xubuntu
20240519-143916/out-tile-0-1-300-6-3.txt:Install Xubuntu
20240519-143916/out-tile-0-1-400-11-3.txt:Install Xubuntu
20240519-143916/out-tile-0-1-400-12-3.txt:Install Xubuntu
20240519-143916/out-tile-0-1-400-4-3.txt:Install Xubuntu
20240519-143916/out-tile-0-1-400-6-3.txt:Install Xubuntu
20240519-143916/out-tile-0-1-500-11-3.txt:Install Xubuntu
20240519-143916/out-tile-0-1-500-12-3.txt:Install Xubuntu
20240519-143916/out-tile-0-1-500-4-3.txt:Install Xubuntu
20240519-143916/out-tile-0-1-500-6-3.txt:Install Xubuntu

Scaling the image wasn't needed at all. Just cropping out all the other stuff, and letting tesseract focus on one thing. This is the successful crop that it used (0,1, first one across, second tile down of four):

(Note I had to convert from ppm to png before uploading, so some compression may have occurred which changes this image)

tile-0-1-100-screenshot_0012_test_installer_initial_load

It might be valuable to allow tesseract a few goes at each screenshot, and only as an option where we know things are problematic (such as poorly dithered wallpapers, bad anti-aliasing, drop-shadows and the like). Because I imagine that once we get past this first screen, most of the rest of the test would run just fine - unless Xubunu has some wacky installer with plasma backgrounds :D

ali1234 commented 1 month ago

What about independently analysing each channel? eg picking only the red channel might help for Xubuntu, it would give this:

I assume this would have to be done outside tesseract but this is the kind of thing I meant about "smarter binarization".

ali1234 commented 1 month ago

The problem as I see it is that tesseract's binarization is too smart. It is designed for scans which often have the problem that brightness varies over the image. It's possible for the background in one part of the image to be darker than the text in another part, like in this example:

https://tesseract-ocr.github.io/tessdoc/images/binarisation.png

So it has to use some kind of adaptive thresholding. That only works well at high resolution because it introduces fringing.

For screenshots we can use simpler methods - take a single channel, or take an absolute threshold across the entire image, because it is unlikely the text we are looking for is going to vary all that much, if at all. The background might vary but it will never be simultaneously brighter and darker than the text.

ali1234 commented 1 month ago

The other problem is that tesseract expects the image to be mostly filled with text, rather than mostly empty space with just a few words scattered around. This is likely why dividing up the image helps. This could be automated by looking at the image gradient to find regions that are not "flat" empty space, then partitioning into rectangles of roughly equal gradient.

ali1234 commented 1 month ago

I had the idea to try ESRGAN trained for text and the result looks really good. I would expect this to perform much better than simple image scaling. Downside is you need CUDA to run it in a reasonable time.

ali1234 commented 1 month ago

For reference, the 8x ESRGAN upscale takes about 2 minutes on i7 6700 or about 4 seconds on RTX 2070.

ali1234 commented 1 month ago

This improved the output for #30 a bit. The clock was recognized, and the words are not run together. Unfortunately it did not help for #29 at all. There is just too little text on the page for tesseract to figure out what is going on.

Next thing to try is manual text detection.

bloominstrong commented 1 week ago

For reference this is what other projects have done.

NixOS Testing suite has a similar function to test graphical applications, they use imagemagick to transform the image to a tiff and a few other things I'm not familiar with, and there is a second option to return different interpretations of the text.

I don't have any first hand experience for how good with a little searching I did see one comment about it being unreliable for small fonts but could not find any other comments complaining about the performance.

quickemu-project / quicktest

feat: Improved tesseract OCR performance #5