Open jbarth-ubhd opened 6 years ago
Please test using script/Latin which supports all languages written in Latin script. That may give better results than eng.
I am assuming that you used tesseract 4.0.0-beta. It is possible that legacy tesseract gives better results than LSTM based.
On Fri, Aug 3, 2018 at 2:58 PM jbarth-ubhd notifications@github.com wrote:
Dear Reader,
I've did some comparison with random text.
- Random text, to test the raw engine performance, not dictionaries
- because foreign, perhaps transcripted (foreign) names sometimes look like "Dhagax", "Hlabisa", "Pniv", ...
here is the original random text: original text https://digi.ub.uni-heidelberg.de/diglitData/v/orig.txt
here is the generated image (font: GaramondNo8): [image: Image of "Scan"] https://camo.githubusercontent.com/3aa4d17c2d9486c47bc4f9c6e19cf2893d9f7c9d/68747470733a2f2f646967692e75622e756e692d68656964656c626572672e64652f6469676c6974446174612f762f6f7269673030312e746966
Result: Filename Levenshtein distance abbyy11r8u3-English.txt 5 abbyy11r8u3-GermanLuxembourg.txt 2 orig.txt 0 tess-eng.txt 221 tess-engWithoutDict.txt 214
Abbyy language "GermanLuxembourg" has no "full dictionary", don't know, what this exactly means, but results are better than "English", because "itsan" would (using English) be recognized as "its an".
engWithoutDict has been made using combine_tessdata & rm *-dawg & combine_tessdata.
Kind regards, Jochen
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata/issues/108, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o2o4YIv-nergpBLmHEFcqnQAy-hWks5uNBfGgaJpZM4VtquT .
--
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
I've updated the table above. Thanks for the hint with script/Latin.traineddata.
Seems that tesseract3 has relatively more less "dictionary" in non-dict traineddata than tesseract4(LSTM).
Much better.
Here the wdiff -3 from tess4-scriptLatinWithoutDict-aq.txt (ą replaced by q):
characters replaced | count |
---|---|
v y | 1 |
i j | 3 |
e c | 4 |
í f | 2 |
I l | 1 |
i l | 1 |
c r | 1 |
o p | 1 |
(statistics not including [-chwin-] {+thwtn+} ... )
[-vrahi-] {+yrahj+}
[-kemiw-] {+kcmiw+}
[-UMWDYV tshhqq-] {+UMWDV tshhq+}
[-tfcemr-] {+tfcmr+}
[-byovsj-] {+byovs+}
{+oanpv+}
[-bfjw-] {+bfjwj+}
[-druar|-] {+druar+}
[-víddh-] {+vfddh+}
[-izabt-] {+jzabt+}
[-Inblf-] {+lnblf+}
[-dyzłj j j-] {+dyzfj+}
[-Wírbk cbked-] {+Wfrbk rbkcd+}
[-ordkp-] {+prdkp+}
[-hcecpn-] {+hccpn+}
[-urvle Wihsx-] {+urvlc Wlhsx+}
[-hmtfkj-] {+hmfkj+}
[-czhi-] {+pczhj+}
[-chwin-] {+thwtn+}
[-hcans-] {+hcqns+}
[-rzhje-] {+rzhjc+}
[-Irngj-] {+lrngj+}
[-o0xws-] {+ooxws+}
[-ubemsc-] {+ubemc+}
[-Ibknv-] {+lbknv+}
If there is an actual use case for this, I would suggest to finetune Latin traineddata with similarly generated random training text - using the Garamond font being tested and finetune for IMPACT - 300-400 iterations only.
Thanks for sharing!
Testing with random characters can make the lstm-based recognizer less accurate than real world text sample, due to the fact that during training the network learns not just letters shapes, but also builds a language model.
Results from ABBYY (GermanLuxembourg):
$ LANG=C dwdiff -3 -s test.gt.txt test.abbyy-GermanLuxembourg.txt
======================================================================
[-Ftvmn-] {+Ltvmn+}
======================================================================
old: 1018 words 1017 99% common 0 0% deleted 1 0% changed
new: 1018 words 1017 99% common 0 0% inserted 1 0% changed
Execution time was 6.6 s.
Result from Tesseract 4.0.0-beta.4 (tessdata/eng, --oem 0):
$ LANG=C dwdiff -3 -s test.gt.txt test.tess0.txt
======================================================================
[-ossac-] {+033210+}
======================================================================
[-Edqgd-] {+Edqu+}
======================================================================
[-olpso gxgko-] {+01pso ngko+}
======================================================================
[-Yfcpd fndtv-] {+chpd fndtV+}
[...]
[-xczxj wbjif axggb ilboa Nhxmg qkgvt-] {+XCZXj ijif 21ngb 11b021 Nthg quVt+}
======================================================================
[-Esxgd dgrjx jyelz-] {+ESng dgrjX jye1z+}
======================================================================
old: 1018 words 379 37% common 0 0% deleted 639 62% changed
new: 1045 words 379 36% common 0 0% inserted 666 63% changed
Execution time was 43.8 s.
With --psm 6
, the result becomes much better:
old: 1018 words 772 75% common 0 0% deleted 246 24% changed
new: 1032 words 772 74% common 0 0% inserted 260 25% changed
Using the lat
traineddata (which was trained with EB Garamond) further improves the recognition:
old: 1018 words 832 81% common 0 0% deleted 186 18% changed
new: 1034 words 832 80% common 0 0% inserted 202 19% changed
Result from Tesseract 4.0.0-beta.4 (tessdata/eng, --oem 1):
$ LANG=C dwdiff -3 -s test.gt.txt test.tess1.txt
======================================================================
[-fguof-] {+fguoft+}
======================================================================
[-byysf Yfcpd fndtv-] {+byyst Yicpd tndtv+}
======================================================================
[-pkqff-] {+pkqtf+}
[...]
[-xcwnc-] {+xcwne+}
======================================================================
[-Vfwfi-] {+Viwti+}
======================================================================
[-krfgr-] {+krigr+}
======================================================================
old: 1018 words 838 82% common 0 0% deleted 180 17% changed
new: 1021 words 838 82% common 0 0% inserted 183 17% changed
Execution time was 85.7 s.
Result from Tesseract 4.0.0-beta.4 (tessdata/script/Latin):
$ LANG=C dwdiff -3 -s test.gt.txt test.tess-latin.txt
[...]
old: 1018 words 975 95% common 0 0% deleted 43 4% changed
new: 1022 words 975 95% common 2 0% inserted 45 4% changed
Execution time was 121 s.
Surprisingly the result becomes better with --psm 6
, so the layout / line detection seems to have effects even for a very simple image like the present test image:
old: 1018 words 988 97% common 0 0% deleted 30 2% changed
new: 1018 words 988 97% common 0 0% inserted 30 2% changed
All test runs with the default page segmentation mode report diacritics (although there are none) which might be related to bad recognition rates:
Detected 143 diacritics
@stweil What about tessdata_fast and tessdata_best?
Apart from Shree question, does Abbyy uses more than one thread by default?
As tessdata
uses fast data derived from tessdata_best
, I don't expect much different results. Nevertheless I can run a test later.
ABBYY used a single thread. The Tesseract timings where also single threaded results. But my first focus is not execution time: quality of the OCR results is much more important for our application on old books and journals. We have other reports that Tesseract beats ABBYY when the text is already split in single lines. That would imply that Tesseract is less good than ABBYY for layout recognition (line separation), maybe also for binarization.
Result from Tesseract 4.0.0-beta.4 (tessdata_best/script/Latin, --psm 6):
old: 1018 words 996 97% common 0 0% deleted 22 2% changed
new: 1018 words 996 97% common 0 0% inserted 22 2% changed
Execution time was 318 s.
Result from Tesseract 4.0.0-beta.4 (tessdata_fast/script/Latin, --psm 6):
old: 1018 words 976 95% common 0 0% deleted 42 4% changed
new: 1018 words 976 95% common 0 0% inserted 42 4% changed
Execution time was 71 s.
tessdata_best | 97% | 318 s | |
tessdata | 97% | 121s | tessdata uses fast data derived from tessdata_best, |
tessdata_fast | 95% | 71s |
https://github.com/tesseract-ocr/tesseract/wiki/NeuralNetsInTesseract4.00#integration-with-tesseract
The Tesseract 4.00 neural network subsystem is integrated into Tesseract as a line recognizer. It can be used with the existing layout analysis to recognize text within a large document, or it can be used in conjunction with an external text detector to recognize text from an image of a single textline.
The neural network engine is the default for 4.00. To recognize text from an image of a single text line, use SetPageSegMode(PSM_RAW_LINE). This can be used from the command-line with -psm 13
@ stweil Would --psm 13 give better results?
Would --psm 13 give better results?
Maybe – what would you suggest for the line separation?
Oh, thanks for pointing this out, that would need to be done externally.
On Wed, Aug 15, 2018 at 1:58 AM Stefan Weil notifications@github.com wrote:
Would --psm 13 give better results?
Maybe – what would you suggest for the line separation?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata/issues/108#issuecomment-413004943, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_oyyqRoRAoDLDORyvt7GsCnHVm4tmks5uQzL-gaJpZM4VtquT .
--
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
@stweil In case, you are looking at improving line detection/page segmentation in tesseract, leptonica has some 'newer' functions which gave good results with test of Arabic and Devanagari.
https://github.com/DanBloomberg/leptonica/issues/236
On Wed, Aug 15, 2018 at 9:40 AM, Shree Devi Kumar shreeshrii@gmail.com wrote:
Oh, thanks for pointing this out, that would need to be done externally.
On Wed, Aug 15, 2018 at 1:58 AM Stefan Weil notifications@github.com wrote:
Would --psm 13 give better results?
Maybe – what would you suggest for the line separation?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata/issues/108#issuecomment-413004943, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_oyyqRoRAoDLDORyvt7GsCnHVm4tmks5uQzL-gaJpZM4VtquT .
--
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
--
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
Dear Reader,
I've did some comparison with random text.
here is the original random text: original text
here is the generated image (font: GaramondNo8):
Result:
Abbyy language "GermanLuxembourg" has no "full dictionary", don't know, what this exactly means, but results are better than "English", because "itsan" would (using English) be recognized as "its an".
engWithoutDict has been made using
Kind regards, Jochen