Open ahri opened 4 years ago
It looks like it's detecting the text as eea6d7bbeS81
and then stripping the S
as it doesn't match the whitelist. I expected whitelisting to constrain the problem space and therefore make it easier for Tesseract to read, but this is clearly not the case.
I am also trying to use Tesseract to OCR random strings of letters and numbers mixed together. And I have the same general problem eas you describe, with Tesseract mixing up 'S' and '5' and also '1' and 'I'.
Tesseract is primarily designed to recognize words and determine what characters are present by what should be there for the word to be valid. So it doesn't naturally deal well with non-word strings.
The only suggestion I have is the following list of config file parameters that I am using to try to prevent Tesseract from using the word-matching method and instead just use a character by character recognition approach:
tessedit_flip_0O 0 load_system_dawg 0 load_freq_dawg 0 language_model_min_compound_length 1 language_model_penalty_increment 0.0 language_model_penalty_punc 0.0 language_model_penalty_spacing 0.0 language_model_penalty_script 0.0 language_model_penalty_non_dict_word 0.0 language_model_penalty_chartype 0.0 language_model_penalty_case 0.0 language_model_penalty_non_freq_dict_word 0.0
To be honest, I don't even know if this makes any difference, or whether the LSTM engine (which I am using) pays attention to these settings.
Hi Tesseract-ocr Team,
I am facing similar challenges as @woodjohndavid.
I am try to recognise and extract a random 50+ character string (UID) from images that are uploaded in my workflow. They need to be 100% correct in order to find the correct UID. In my current OCR results, I get random spaces and incorrect characters being recognised as per @woodjohndavid's explanation. See below my example baseline image in order to get the best results:
My quick and dirty bash test (FYI: building a web app with Python that will be the finished product):
./bash_test.sh
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 373
OSD: Weak margin (0.83) for 52 blob text block, but using orientation anyway: 0
--------
Wrong!!!
--------
RFx1BGE6te9IoPS3Sx5GM-9WUBTwrVSzCR1IJzStRqhBwjvsZm25Kw== (Correct)
RFx1BGE6te9ToPS3Sx5GM—9WUBTwrVSzZCRIIJzStRqhBwj vsZm25Kw== (Result)
My bash script for testing:
#!/bin/bash
# Bash script to test tesseract-ocr
tesseract /path/to/image/baseline_test.png output --psm 1
# I have tried all psm options. 13 is actually the best with this type of single line image
text=$(cat output.txt)
# Remove spaces
# text=${text//[[:blank:]]/}
if [ "$text" == "RFx1BGE6te9IoPS3Sx5GM-9WUBTwrVSzCR1IJzStRqhBwjvsZm25Kw==" ];
then
echo "########"
echo "Correct."
echo "########"
else
echo "--------"
echo "Wrong!!!"
echo "--------"
echo "RFx1BGE6te9IoPS3Sx5GM-9WUBTwrVSzCR1IJzStRqhBwjvsZm25Kw== (Correct)"
echo ""${text}" (Result)"
echo "========"
fi
Any thoughts or suggestions you might have pertaining to this issue? Is it possible for Tesseract-ocr to recognise these long UID's?
Looking forward to your response, thank you.
tessedit_flip_0O 0
load_system_dawg 0
load_freq_dawg 0
language_model_min_compound_length 1
language_model_penalty_increment 0.0
language_model_penalty_punc 0.0
language_model_penalty_spacing 0.0
language_model_penalty_script 0.0
language_model_penalty_non_dict_word 0.0
language_model_penalty_chartype 0.0
language_model_penalty_case 0.0
language_model_penalty_non_freq_dict_word 0.0
To be honest, I don't even know if this makes any difference, or whether the LSTM engine (which I am using) pays attention to these settings.
tessedit_flip_0O
and anything that start with language_model_
are ignored by the LSTM engine.
@ahri,
about the whitelist issue.
https://github.com/tesseract-ocr/tesseract/issues/2760#issuecomment-560372382
about the whitelist issue.
The better reference would be here – the reason for the current behaviour of white/blacklisting – which is indeed of little practical use – is the narrowness of the default beam in the LSTM decoder. The lstm_choice_mode
option (going deeper by creating different beams again and again) unfortunately does not help that. (It is only used for GetChoiceIterator
, not to prevent null hypotheses when the user dict does not allow certain choices. Plus it only works for certain LSTM models.)
Environment
Current Behavior:
Config file:
Image:
Command-line usage:
Expected Behavior:
i.e.
To elaborate; I would like to detect only a single hex number rendered in monospace in a PNG.