Open CanadianHusky opened 1 year ago
The model script/Latin
also gives the correct result. That's not an issue of tesseract
, but of the specific model.
tesseract https://user-images.githubusercontent.com/35332003/212057328-3166e017-d192-450a-8da8-4a0b53c81839.png - -l script/Latin
S-110
Thank you for the prompt response. Based on your response, Is it safe to assume that model script/Latin can safely be used to replace eng.traineddata. Are all special characters like brackets, dash, dot, comma, question mark etc included ?
I hope so. See https://github.com/tesseract-ocr/langdata_lstm/blob/main/script/Latin/Latin.unicharset and compare it to https://github.com/tesseract-ocr/langdata_lstm/blob/main/eng/eng.unicharset.
But I would not expect that script/Latin
gives always better results than eng
.
But I would not expect that script/Latin gives always better results than eng.
I feared that, because fixing the reported issue by replacing the model will introduce unknown mistakes elsewhere. Not an ideal solution. An update to a new/better eng.traineddata would be better because I think anyone using English language would expect a clean letter S to be detected correctly, considering it is a default mandatory install and present with the installer.
tesseract https://user-images.githubusercontent.com/35332003/212068999-2e0890b1-b024-4e93-a0bf-b1a424d8af6c.png stdout -l script/Latin
sS-502
tesseract https://user-images.githubusercontent.com/35332003/212068999-2e0890b1-b024-4e93-a0bf-b1a424d8af6c.png stdout -l eng
$-502
tesseract https://user-images.githubusercontent.com/35332003/212068999-2e0890b1-b024-4e93-a0bf-b1a424d8af6c.png stdout -l deu
S-502
really inconsistent and unexplainable results with script/Latin too
tesseract https://user-images.githubusercontent.com/35332003/212068999-2e0890b1-b024-4e93-a0bf-b1a424d8af6c.png stdout -l script/Latin sS-502
tesseract https://user-images.githubusercontent.com/35332003/212068999-2e0890b1-b024-4e93-a0bf-b1a424d8af6c.png stdout -l eng $-502
tesseract https://user-images.githubusercontent.com/35332003/212068999-2e0890b1-b024-4e93-a0bf-b1a424d8af6c.png stdout -l deu S-502
really inconsistent and unexplainable results with script/Latin too
In this case you could use voting in post-correction. Run all 3 models and compare.
{S:2, s:1, $:1}
and again vote. Or you have $ versus S. This can be solved by frequency. The letters 'ernsti' are the most frequent in European languages. Thus 'S' overrules '$'. Building your own character frequency table is simple, if you have a special type of corpus. It's just counting characters into a hash (or dict in Python, bag, multiset, an indexed array).
In historical OCR I could improve CER (Character Error Rate) by voting from 1% to 0.9%. With respecting frequencies (of words) I reached 0.36%.
While I understand and appreciate the suggestion to improve the result with post-processing, I do not think it is feasible in many real-world cases.
The sample I provided above is a minimized version out of extreme amount of text, from unknown source/origin at variable length with no pattern or known length. The only thing that is known with certainty is that it is latin letters.
Running all 3 models alone for voting purposes means loosing 3 times the speed. The Result would be long text blocks and complicated logic would need to determine which words/characters are matching across different model runs to do a frequency or length analysis.
Which character overrules what based on frequency is also a questionable approach and only good if the assumption is human readable text, because some data contains non-natural words, sequences of characters and special signs combined which are codes or tokens. $-502 may very well be a correct result based on the image, overruling that with post-processing code defeats the purpose of OCR and of the business process.
I think it is reasonable to expect from Tesseract's supplied traineddata model files that such clean input is recognized with correct accuracy in english and latin models.
That's all true.
But OCR is still limited. As all deep learning models it suffers from "never seen before". Second, it's Supervised Machine Learning and not adaptive.
And assuming Latin characters only is a bias. Even in the Englisch documentation of Perl there are Japanese, Chinese and Greek characters. Same in German newspapers or Wikipedia.
In the moment it's not possible to train a model as "one fits all". It would be very expensive.
But the standard models for (modern) English or German could be improved by e.g. including bullet-like symbols in the training data, as they are often used in modern texts.
By tweaking the command instead of getting the output on the command line we wrote the output on a separate text file and updating the tesseract library correct OCR detection was performed for required english language
The command used was -
tesseract "/file path/" "/output file path/" -l eng
Basic Information
tesseract v5.3.0.20221214 leptonica-1.78.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 Found AVX2 Found AVX Found FMA Found SSE4.1 Found libarchive 3.5.0 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5 libzstd/1.4.5 Found libcurl/7.77.0-DEV Schannel zlib/1.2.11 zstd/1.4.5 libidn2/2.0.4 nghttp2/1.31.0
Operating System
Windows 10
Other Operating System
No response
uname -a
No response
Compiler
No response
Virtualization / Containers
No response
CPU
AMD Ryzen 9 3900
Current Behavior
Result is wrong, despite very clean input image. Why ?
Result is correct only by changing lanuage model to German
Expected Behavior
eng.traineddata that comes with the standard installation should work correct on such a simple input and produce S-110
The sample I provided is just one basic example. I can produce more examples where very simple input is processed wrong with LSTM engine, but correct with Legacy models.
Suggested Fix
The standard language file for eng.traineddata needs to be checked why it is producing wrong results normal and best traineddata files for english fail on the provided sample as well. Additional language data files are taken from https://github.com/tesseract-ocr/tessdata_best and /tessdata
The LSTM Engine, although much faster, shows a regression in this sample because the same file works fine with english language and non_LSTM (legacy) code
Other Information
sample png that produces $-110 when -l eng is used