wrong result with very simple input

CanadianHusky commented 1 year ago

Basic Information

tesseract v5.3.0.20221214 leptonica-1.78.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 Found AVX2 Found AVX Found FMA Found SSE4.1 Found libarchive 3.5.0 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5 libzstd/1.4.5 Found libcurl/7.77.0-DEV Schannel zlib/1.2.11 zstd/1.4.5 libidn2/2.0.4 nghttp2/1.31.0

Operating System

Windows 10

Other Operating System

No response

uname -a

No response

Compiler

No response

Virtualization / Containers

No response

CPU

AMD Ryzen 9 3900

Current Behavior

C:\Program Files\Tesseract-OCR>tesseract "d:\temp\1b35f628-0719-4e44-8b65-6091dce488c9.png" stdout -l eng
$-110

Result is wrong, despite very clean input image. Why ?

C:\Program Files\Tesseract-OCR>tesseract "d:\temp\1b35f628-0719-4e44-8b65-6091dce488c9.png" stdout -l deu
S-110

Result is correct only by changing lanuage model to German

Expected Behavior

eng.traineddata that comes with the standard installation should work correct on such a simple input and produce S-110

The sample I provided is just one basic example. I can produce more examples where very simple input is processed wrong with LSTM engine, but correct with Legacy models.

Suggested Fix

The standard language file for eng.traineddata needs to be checked why it is producing wrong results normal and best traineddata files for english fail on the provided sample as well. Additional language data files are taken from https://github.com/tesseract-ocr/tessdata_best and /tessdata

The LSTM Engine, although much faster, shows a regression in this sample because the same file works fine with english language and non_LSTM (legacy) code

Other Information

1b35f628-0719-4e44-8b65-6091dce488c9

sample png that produces $-110 when -l eng is used

stweil commented 1 year ago

The model script/Latin also gives the correct result. That's not an issue of tesseract, but of the specific model.

tesseract https://user-images.githubusercontent.com/35332003/212057328-3166e017-d192-450a-8da8-4a0b53c81839.png - -l script/Latin
S-110

CanadianHusky commented 1 year ago

Thank you for the prompt response. Based on your response, Is it safe to assume that model script/Latin can safely be used to replace eng.traineddata. Are all special characters like brackets, dash, dot, comma, question mark etc included ?

stweil commented 1 year ago

I hope so. See https://github.com/tesseract-ocr/langdata_lstm/blob/main/script/Latin/Latin.unicharset and compare it to https://github.com/tesseract-ocr/langdata_lstm/blob/main/eng/eng.unicharset.

But I would not expect that script/Latin gives always better results than eng.

CanadianHusky commented 1 year ago

But I would not expect that script/Latin gives always better results than eng.

I feared that, because fixing the reported issue by replacing the model will introduce unknown mistakes elsewhere. Not an ideal solution. An update to a new/better eng.traineddata would be better because I think anyone using English language would expect a clean letter S to be detected correctly, considering it is a default mandatory install and present with the installer.

CanadianHusky commented 1 year ago

s-502

tesseract https://user-images.githubusercontent.com/35332003/212068999-2e0890b1-b024-4e93-a0bf-b1a424d8af6c.png stdout -l script/Latin
sS-502

tesseract https://user-images.githubusercontent.com/35332003/212068999-2e0890b1-b024-4e93-a0bf-b1a424d8af6c.png stdout -l eng
$-502

tesseract https://user-images.githubusercontent.com/35332003/212068999-2e0890b1-b024-4e93-a0bf-b1a424d8af6c.png stdout -l deu
S-502

really inconsistent and unexplainable results with script/Latin too

wollmers commented 1 year ago

tesseract https://user-images.githubusercontent.com/35332003/212068999-2e0890b1-b024-4e93-a0bf-b1a424d8af6c.png stdout -l script/Latin sS-502

tesseract https://user-images.githubusercontent.com/35332003/212068999-2e0890b1-b024-4e93-a0bf-b1a424d8af6c.png stdout -l eng $-502

tesseract https://user-images.githubusercontent.com/35332003/212068999-2e0890b1-b024-4e93-a0bf-b1a424d8af6c.png stdout -l deu S-502
really inconsistent and unexplainable results with script/Latin too

In this case you could use voting in post-correction. Run all 3 models and compare.

vote the length: ($-502, S-502) versus (sS-502)
put the letters of the different part into a bag {S:2, s:1, $:1} and again vote.

Or you have $ versus S. This can be solved by frequency. The letters 'ernsti' are the most frequent in European languages. Thus 'S' overrules '$'. Building your own character frequency table is simple, if you have a special type of corpus. It's just counting characters into a hash (or dict in Python, bag, multiset, an indexed array).

In historical OCR I could improve CER (Character Error Rate) by voting from 1% to 0.9%. With respecting frequencies (of words) I reached 0.36%.

CanadianHusky commented 1 year ago

While I understand and appreciate the suggestion to improve the result with post-processing, I do not think it is feasible in many real-world cases.

The sample I provided above is a minimized version out of extreme amount of text, from unknown source/origin at variable length with no pattern or known length. The only thing that is known with certainty is that it is latin letters.

Running all 3 models alone for voting purposes means loosing 3 times the speed. The Result would be long text blocks and complicated logic would need to determine which words/characters are matching across different model runs to do a frequency or length analysis.

Which character overrules what based on frequency is also a questionable approach and only good if the assumption is human readable text, because some data contains non-natural words, sequences of characters and special signs combined which are codes or tokens. $-502 may very well be a correct result based on the image, overruling that with post-processing code defeats the purpose of OCR and of the business process.

I think it is reasonable to expect from Tesseract's supplied traineddata model files that such clean input is recognized with correct accuracy in english and latin models.

wollmers commented 1 year ago

That's all true.

But OCR is still limited. As all deep learning models it suffers from "never seen before". Second, it's Supervised Machine Learning and not adaptive.

And assuming Latin characters only is a bias. Even in the Englisch documentation of Perl there are Japanese, Chinese and Greek characters. Same in German newspapers or Wikipedia.

In the moment it's not possible to train a model as "one fits all". It would be very expensive.

But the standard models for (modern) English or German could be improved by e.g. including bullet-like symbols in the training data, as they are often used in modern texts.

Kaustubh-3105 commented 1 year ago

By tweaking the command instead of getting the output on the command line we wrote the output on a separate text file and updating the tesseract library correct OCR detection was performed for required english language

The command used was -

tesseract "/file path/" "/output file path/" -l eng

tesseract-ocr / tesseract