tesseract-ocr / tessdata

Trained models with fast variant of the "best" LSTM models + legacy models
Apache License 2.0
6.46k stars 2.2k forks source link

Synthetical comparison with Abbyy #108

Open jbarth-ubhd opened 6 years ago

jbarth-ubhd commented 6 years ago

Dear Reader,

I've did some comparison with random text.

here is the original random text: original text

here is the generated image (font: GaramondNo8): Image of "Scan"

Result:

Filename Levenshtein distance
abbyy11-English.txt 5
abbyy11-GermanLuxembourg.txt 2
orig.txt 0
v3.04.01 tess3-eng.txt 1273
v3.04.01 tess3-engWithoutDict.txt 763
v4.0.0-beta.2-556-g607e tess4-eng.txt 222
v4.0.0-beta.2-556-g607e tess4-engWithoutDict.txt 215
v4.0.0-beta.2-556-g607e tess4-scriptLatin.txt 62
v4.1.0 __ tess4-scriptLatin.txt 62
v4.0.0-beta.2-556-g607e tess4-scriptLatinWithoutDict.txt 58
v4.0.0-beta.2-556-g607e tess4-scriptLatinWithoutDict.txt, ą replaced by q manually 45

Abbyy language "GermanLuxembourg" has no "full dictionary", don't know, what this exactly means, but results are better than "English", because "itsan" would (using English) be recognized as "its an".

engWithoutDict has been made using

combine_tessdata -u ...
rm *-dawg 
combine_tessdata ...

Kind regards, Jochen

Shreeshrii commented 6 years ago

Please test using script/Latin which supports all languages written in Latin script. That may give better results than eng.

I am assuming that you used tesseract 4.0.0-beta. It is possible that legacy tesseract gives better results than LSTM based.

On Fri, Aug 3, 2018 at 2:58 PM jbarth-ubhd notifications@github.com wrote:

Dear Reader,

I've did some comparison with random text.

  • Random text, to test the raw engine performance, not dictionaries
  • because foreign, perhaps transcripted (foreign) names sometimes look like "Dhagax", "Hlabisa", "Pniv", ...

here is the original random text: original text https://digi.ub.uni-heidelberg.de/diglitData/v/orig.txt

here is the generated image (font: GaramondNo8): [image: Image of "Scan"] https://camo.githubusercontent.com/3aa4d17c2d9486c47bc4f9c6e19cf2893d9f7c9d/68747470733a2f2f646967692e75622e756e692d68656964656c626572672e64652f6469676c6974446174612f762f6f7269673030312e746966

Result: Filename Levenshtein distance abbyy11r8u3-English.txt 5 abbyy11r8u3-GermanLuxembourg.txt 2 orig.txt 0 tess-eng.txt 221 tess-engWithoutDict.txt 214

Abbyy language "GermanLuxembourg" has no "full dictionary", don't know, what this exactly means, but results are better than "English", because "itsan" would (using English) be recognized as "its an".

engWithoutDict has been made using combine_tessdata & rm *-dawg & combine_tessdata.

Kind regards, Jochen

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata/issues/108, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o2o4YIv-nergpBLmHEFcqnQAy-hWks5uNBfGgaJpZM4VtquT .

--


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

jbarth-ubhd commented 6 years ago

I've updated the table above. Thanks for the hint with script/Latin.traineddata.

Seems that tesseract3 has relatively more less "dictionary" in non-dict traineddata than tesseract4(LSTM).

Much better.

jbarth-ubhd commented 6 years ago

Here the wdiff -3 from tess4-scriptLatinWithoutDict-aq.txt (ą replaced by q):

characters replaced count
v y 1
i j 3
e c 4
í f 2
I l 1
i l 1
c r 1
o p 1

(statistics not including [-chwin-] {+thwtn+} ... )

 [-vrahi-] {+yrahj+}
 [-kemiw-] {+kcmiw+}
 [-UMWDYV tshhqq-] {+UMWDV tshhq+}
 [-tfcemr-] {+tfcmr+}
 [-byovsj-] {+byovs+}
 {+oanpv+}
 [-bfjw-] {+bfjwj+}
 [-druar|-] {+druar+}
[-víddh-] {+vfddh+}
 [-izabt-] {+jzabt+}
 [-Inblf-] {+lnblf+}
[-dyzłj j j-] {+dyzfj+}
 [-Wírbk cbked-] {+Wfrbk rbkcd+}
 [-ordkp-] {+prdkp+}
 [-hcecpn-] {+hccpn+}
 [-urvle Wihsx-] {+urvlc Wlhsx+}
 [-hmtfkj-] {+hmfkj+}
 [-czhi-] {+pczhj+}
 [-chwin-] {+thwtn+}
 [-hcans-] {+hcqns+}
 [-rzhje-] {+rzhjc+}
 [-Irngj-] {+lrngj+}
 [-o0xws-] {+ooxws+}
 [-ubemsc-] {+ubemc+}
[-Ibknv-] {+lbknv+}
Shreeshrii commented 6 years ago

If there is an actual use case for this, I would suggest to finetune Latin traineddata with similarly generated random training text - using the Garamond font being tested and finetune for IMPACT - 300-400 iterations only.

amitdo commented 6 years ago

Thanks for sharing!

Testing with random characters can make the lstm-based recognizer less accurate than real world text sample, due to the fact that during training the network learns not just letters shapes, but also builds a language model.

stweil commented 6 years ago

Results from ABBYY (GermanLuxembourg):

$ LANG=C dwdiff -3 -s test.gt.txt test.abbyy-GermanLuxembourg.txt
======================================================================
 [-Ftvmn-] {+Ltvmn+}
======================================================================
old: 1018 words  1017 99% common  0 0% deleted  1 0% changed
new: 1018 words  1017 99% common  0 0% inserted  1 0% changed

Execution time was 6.6 s.

stweil commented 6 years ago

Result from Tesseract 4.0.0-beta.4 (tessdata/eng, --oem 0):

$ LANG=C dwdiff -3 -s test.gt.txt test.tess0.txt 
======================================================================
 [-ossac-] {+033210+}
======================================================================
 [-Edqgd-] {+Edqu+}
======================================================================
 [-olpso gxgko-] {+01pso ngko+}
======================================================================
 [-Yfcpd fndtv-] {+chpd fndtV+}
[...]
 [-xczxj wbjif axggb ilboa Nhxmg qkgvt-] {+XCZXj ijif 21ngb 11b021 Nthg quVt+}
======================================================================
 [-Esxgd dgrjx jyelz-] {+ESng dgrjX jye1z+}
======================================================================
old: 1018 words  379 37% common  0 0% deleted  639 62% changed
new: 1045 words  379 36% common  0 0% inserted  666 63% changed

Execution time was 43.8 s.

With --psm 6, the result becomes much better:

old: 1018 words  772 75% common  0 0% deleted  246 24% changed
new: 1032 words  772 74% common  0 0% inserted  260 25% changed

Using the lat traineddata (which was trained with EB Garamond) further improves the recognition:

old: 1018 words  832 81% common  0 0% deleted  186 18% changed
new: 1034 words  832 80% common  0 0% inserted  202 19% changed
stweil commented 6 years ago

Result from Tesseract 4.0.0-beta.4 (tessdata/eng, --oem 1):

$ LANG=C dwdiff -3 -s test.gt.txt test.tess1.txt 
======================================================================
 [-fguof-] {+fguoft+}
======================================================================
 [-byysf Yfcpd fndtv-] {+byyst Yicpd tndtv+}
======================================================================
 [-pkqff-] {+pkqtf+}
[...]
 [-xcwnc-] {+xcwne+}
======================================================================
 [-Vfwfi-] {+Viwti+}
======================================================================
 [-krfgr-] {+krigr+}
======================================================================
old: 1018 words  838 82% common  0 0% deleted  180 17% changed
new: 1021 words  838 82% common  0 0% inserted  183 17% changed

Execution time was 85.7 s.

stweil commented 6 years ago

Result from Tesseract 4.0.0-beta.4 (tessdata/script/Latin):

$ LANG=C dwdiff -3 -s test.gt.txt test.tess-latin.txt 
[...]
old: 1018 words  975 95% common  0 0% deleted  43 4% changed
new: 1022 words  975 95% common  2 0% inserted  45 4% changed

Execution time was 121 s.

Surprisingly the result becomes better with --psm 6, so the layout / line detection seems to have effects even for a very simple image like the present test image:

old: 1018 words  988 97% common  0 0% deleted  30 2% changed
new: 1018 words  988 97% common  0 0% inserted  30 2% changed

All test runs with the default page segmentation mode report diacritics (although there are none) which might be related to bad recognition rates:

Detected 143 diacritics
Shreeshrii commented 6 years ago

@stweil What about tessdata_fast and tessdata_best?

amitdo commented 6 years ago

Apart from Shree question, does Abbyy uses more than one thread by default?

stweil commented 6 years ago

As tessdata uses fast data derived from tessdata_best, I don't expect much different results. Nevertheless I can run a test later.

ABBYY used a single thread. The Tesseract timings where also single threaded results. But my first focus is not execution time: quality of the OCR results is much more important for our application on old books and journals. We have other reports that Tesseract beats ABBYY when the text is already split in single lines. That would imply that Tesseract is less good than ABBYY for layout recognition (line separation), maybe also for binarization.

stweil commented 6 years ago

Result from Tesseract 4.0.0-beta.4 (tessdata_best/script/Latin, --psm 6):

old: 1018 words  996 97% common  0 0% deleted  22 2% changed
new: 1018 words  996 97% common  0 0% inserted  22 2% changed

Execution time was 318 s.

Result from Tesseract 4.0.0-beta.4 (tessdata_fast/script/Latin, --psm 6):

old: 1018 words  976 95% common  0 0% deleted  42 4% changed
new: 1018 words  976 95% common  0 0% inserted  42 4% changed

Execution time was 71 s.

Shreeshrii commented 6 years ago
tessdata_best 97% 318 s
tessdata 97% 121s tessdata uses fast data derived from tessdata_best,
tessdata_fast 95% 71s
Shreeshrii commented 6 years ago

https://github.com/tesseract-ocr/tesseract/wiki/NeuralNetsInTesseract4.00#integration-with-tesseract

The Tesseract 4.00 neural network subsystem is integrated into Tesseract as a line recognizer. It can be used with the existing layout analysis to recognize text within a large document, or it can be used in conjunction with an external text detector to recognize text from an image of a single textline.

The neural network engine is the default for 4.00. To recognize text from an image of a single text line, use SetPageSegMode(PSM_RAW_LINE). This can be used from the command-line with -psm 13

@ stweil Would --psm 13 give better results?

stweil commented 6 years ago

Would --psm 13 give better results?

Maybe – what would you suggest for the line separation?

Shreeshrii commented 6 years ago

Oh, thanks for pointing this out, that would need to be done externally.

On Wed, Aug 15, 2018 at 1:58 AM Stefan Weil notifications@github.com wrote:

Would --psm 13 give better results?

Maybe – what would you suggest for the line separation?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata/issues/108#issuecomment-413004943, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_oyyqRoRAoDLDORyvt7GsCnHVm4tmks5uQzL-gaJpZM4VtquT .

--


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shreeshrii commented 6 years ago

@stweil In case, you are looking at improving line detection/page segmentation in tesseract, leptonica has some 'newer' functions which gave good results with test of Arabic and Devanagari.

https://github.com/DanBloomberg/leptonica/issues/236

On Wed, Aug 15, 2018 at 9:40 AM, Shree Devi Kumar shreeshrii@gmail.com wrote:

Oh, thanks for pointing this out, that would need to be done externally.

On Wed, Aug 15, 2018 at 1:58 AM Stefan Weil notifications@github.com wrote:

Would --psm 13 give better results?

Maybe – what would you suggest for the line separation?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata/issues/108#issuecomment-413004943, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_oyyqRoRAoDLDORyvt7GsCnHVm4tmks5uQzL-gaJpZM4VtquT .

--


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com