tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.37k stars 9.52k forks source link

Simple monospace text not correctly interpretted #2820

Open ahri opened 4 years ago

ahri commented 4 years ago

Environment

Current Behavior:

Config file:

tessedit_char_whitelist 0123456789abcdef

Image:

avatar

Command-line usage:

$ tesseract avatar.png - --psm 7 --oem 1 tess_config
eea6d7bbe81

Expected Behavior:

eea6d7bbe581

i.e.

eea6d7bbe81 (incorrect)
vs.
eea6d7bbe581 (correct)

To elaborate; I would like to detect only a single hex number rendered in monospace in a PNG.

ahri commented 4 years ago

It looks like it's detecting the text as eea6d7bbeS81 and then stripping the S as it doesn't match the whitelist. I expected whitelisting to constrain the problem space and therefore make it easier for Tesseract to read, but this is clearly not the case.

woodjohndavid commented 4 years ago

I am also trying to use Tesseract to OCR random strings of letters and numbers mixed together. And I have the same general problem eas you describe, with Tesseract mixing up 'S' and '5' and also '1' and 'I'.

Tesseract is primarily designed to recognize words and determine what characters are present by what should be there for the word to be valid. So it doesn't naturally deal well with non-word strings.

The only suggestion I have is the following list of config file parameters that I am using to try to prevent Tesseract from using the word-matching method and instead just use a character by character recognition approach:

tessedit_flip_0O 0 load_system_dawg 0 load_freq_dawg 0 language_model_min_compound_length 1 language_model_penalty_increment 0.0 language_model_penalty_punc 0.0 language_model_penalty_spacing 0.0 language_model_penalty_script 0.0 language_model_penalty_non_dict_word 0.0 language_model_penalty_chartype 0.0 language_model_penalty_case 0.0 language_model_penalty_non_freq_dict_word 0.0

To be honest, I don't even know if this makes any difference, or whether the LSTM engine (which I am using) pays attention to these settings.

ryanleonbutler commented 4 years ago

Hi Tesseract-ocr Team,

I am facing similar challenges as @woodjohndavid.

I am try to recognise and extract a random 50+ character string (UID) from images that are uploaded in my workflow. They need to be 100% correct in order to find the correct UID. In my current OCR results, I get random spaces and incorrect characters being recognised as per @woodjohndavid's explanation. See below my example baseline image in order to get the best results:

baseline_test

My quick and dirty bash test (FYI: building a web app with Python that will be the finished product):

./bash_test.sh
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 373
OSD: Weak margin (0.83) for 52 blob text block, but using orientation anyway: 0
--------
Wrong!!!
--------
RFx1BGE6te9IoPS3Sx5GM-9WUBTwrVSzCR1IJzStRqhBwjvsZm25Kw== (Correct)
RFx1BGE6te9ToPS3Sx5GM—9WUBTwrVSzZCRIIJzStRqhBwj vsZm25Kw== (Result)

My bash script for testing:

#!/bin/bash
# Bash script to test tesseract-ocr

tesseract /path/to/image/baseline_test.png output --psm 1
# I have tried all psm options. 13 is actually the best with this type of single line image
text=$(cat output.txt)

# Remove spaces
# text=${text//[[:blank:]]/}

if [ "$text" == "RFx1BGE6te9IoPS3Sx5GM-9WUBTwrVSzCR1IJzStRqhBwjvsZm25Kw==" ];
then
    echo "########"
    echo "Correct."
    echo "########"
else
    echo "--------"
    echo "Wrong!!!"
    echo "--------"
    echo "RFx1BGE6te9IoPS3Sx5GM-9WUBTwrVSzCR1IJzStRqhBwjvsZm25Kw== (Correct)"
    echo ""${text}" (Result)"
    echo "========"
fi

Any thoughts or suggestions you might have pertaining to this issue? Is it possible for Tesseract-ocr to recognise these long UID's?

Looking forward to your response, thank you.

amitdo commented 4 years ago
tessedit_flip_0O 0
load_system_dawg 0
load_freq_dawg 0
language_model_min_compound_length 1
language_model_penalty_increment 0.0
language_model_penalty_punc 0.0
language_model_penalty_spacing 0.0
language_model_penalty_script 0.0
language_model_penalty_non_dict_word 0.0
language_model_penalty_chartype 0.0
language_model_penalty_case 0.0
language_model_penalty_non_freq_dict_word 0.0

To be honest, I don't even know if this makes any difference, or whether the LSTM engine (which I am using) pays attention to these settings.

tessedit_flip_0O and anything that start with language_model_ are ignored by the LSTM engine.

amitdo commented 4 years ago

@ahri,

about the whitelist issue.

https://github.com/tesseract-ocr/tesseract/issues/2760#issuecomment-560372382

bertsky commented 3 years ago

about the whitelist issue.

#2760 (comment)

The better reference would be here – the reason for the current behaviour of white/blacklisting – which is indeed of little practical use – is the narrowness of the default beam in the LSTM decoder. The lstm_choice_mode option (going deeper by creating different beams again and again) unfortunately does not help that. (It is only used for GetChoiceIterator, not to prevent null hypotheses when the user dict does not allow certain choices. Plus it only works for certain LSTM models.)