tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.94k stars 9.48k forks source link

default PSM (--psm 3) accuracy issues #1327

Open Shreeshrii opened 6 years ago

Shreeshrii commented 6 years ago

Tesseract commit # https://github.com/tesseract-ocr/tesseract/commit/a50ff5277daad5d9831f82fc07cb39ba8f1ed589

-l eng

Using traineddata files from tessdata_fast

test image attached:

toc

Shreeshrii commented 6 years ago

File: ./toc.png **** ./toc.png /tessdata_fast/ ** Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica Detected 2 diacritics

** --oem 0 psm 1

Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.

** --oem 0 psm 3

Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.

** --oem 0 psm 4

Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.

** --oem 0 psm 6

Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.

** --oem 0 psm 11

Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.

** --oem 0 psm 12

Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.

** --oem 1 psm 1

Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica OSD: Weak margin (0.77) for 37 blob text block, but using orientation anyway: 0 Detected 2 diacritics

** --oem 1 psm 3

Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica Detected 2 diacritics

** --oem 1 psm 4

Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica Detected 2 diacritics

** --oem 1 psm 6

Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica

** --oem 1 psm 11

Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica Detected 2 diacritics

** --oem 1 psm 12

Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica OSD: Weak margin (0.77) for 37 blob text block, but using orientation anyway: 0 Detected 2 diacritics

Shreeshrii commented 6 years ago

With --oem 0, the error is

Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.

Instead, the message should indicate that

Requested Legacy Tesseract model does not exist in traineddata.
Using LSTM engine and model instead.

and change the --oem value internally before processing.

Shreeshrii commented 6 years ago

Also, the default of --psm 3 for command line does not give the best results with LSTM engine. Please consider whether it needs to be changed to --psm 6 as default.

default result

1 First chapter

wo 0 eo 00

cena


--psm 6

1 First chapter 3
1.1 Section One 3
1.2 Section Two 3
1.3 Section Three 3

2 Last chapter 5
2.1 Section One 5
22 Section Two 5
23 Section Three 5

Attaching the OCRed text from the above image.

toc-eng-oem-1-psm-6.txt toc-eng-oem-1-psm-4.txt toc-eng-oem-1-psm-3.txt toc-eng-oem-1-psm-1.txt toc-eng-default.txt toc-eng-oem-1-psm-12.txt toc-eng-oem-1-psm-11.txt

Shreeshrii commented 5 years ago

Retested with latest code - the issues with --psm 3 still exist.

tesseract 4.1.0-rc1-250-g95a1
 leptonica-1.76.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0

 *****  ./English-TOC.png LANG eng TESSDATA tessdata OEM 0 PSM 3 ****
Detected 2 diacritics
1 mm chépter

wwww

macaw

 *****  ./English-TOC.png LANG eng TESSDATA tessdata OEM 0 PSM 6 ****
1 mm chapter 3
1.1 SectionOne 3
1.2 Section'l‘wn 3
1.3 Section'l'hme 3

2 Last chapter 5
2.1 SectionOne 5
2.2 Section'l‘wn 5
2.3 Section'l'hme 5

 *****  ./English-TOC.png LANG eng TESSDATA tessdata OEM 1 PSM 3 ****
Detected 2 diacritics
1 First chapter

wesw

aaa

 *****  ./English-TOC.png LANG eng TESSDATA tessdata OEM 1 PSM 6 ****
1 First chapter 3
1.1 Section One 3
1.2 Section Two 3
1.3 Section Three 3

2 Last chapter 5
2.1 Section One 5
22 Section Two 5
23 Section Three 5

 *****  ./English-TOC.png LANG eng TESSDATA tessdata_best OEM 0 PSM 3 ****
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.

 *****  ./English-TOC.png LANG eng TESSDATA tessdata_best OEM 0 PSM 6 ****
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.

 *****  ./English-TOC.png LANG eng TESSDATA tessdata_best OEM 1 PSM 3 ****
Detected 2 diacritics
1 First chapter

wows

aaa

 *****  ./English-TOC.png LANG eng TESSDATA tessdata_best OEM 1 PSM 6 ****
1 First chapter 3
1.1 Section One 3
1.2 Section Two 3
1.3 Section Three 3

2 Last chapter 5
2.1 Section One 5
22 Section Two 5
23 Section Three 5

 *****  ./English-TOC.png LANG eng TESSDATA tessdata_fast OEM 0 PSM 3 ****
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.

 *****  ./English-TOC.png LANG eng TESSDATA tessdata_fast OEM 0 PSM 6 ****
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.

 *****  ./English-TOC.png LANG eng TESSDATA tessdata_fast OEM 1 PSM 3 ****
Detected 2 diacritics
1 First chapter

wo 0 eo 00

cena

 *****  ./English-TOC.png LANG eng TESSDATA tessdata_fast OEM 1 PSM 6 ****
1 First chapter 3
1.1 Section One 3
1.2 Section Two 3
1.3 Section Three 3

2 Last chapter 5
2.1 Section One 5
22 Section Two 5
23 Section Three 5

DONE
Shreeshrii commented 5 years ago

Also note, while --oem 0 gets the numbered sections 2.2 and 2.3 correct, the text on lines is incorrect. --oem 1 gets the section numbers as 22 and 23 but the text is correct.

Shreeshrii commented 5 years ago

Also see https://github.com/tesseract-ocr/tesseract/issues/2381 Issue with chi_tra tessdata_fast and --psm 3

zdenop commented 5 years ago

@Shreeshrii: 0150fc57ccbdbf64381ad534f969a63c9942e3b7 should report if tesseract engine (legacy) is not present. But I am not sure about auto usage of LSTM (or vice versa). When user specifies oem there should be reason for it => exit is more reasonable behavior for me that continue with not desired model.

Shreeshrii commented 5 years ago

@zdenop >report if tesseract engine (legacy) is not present.

Yes that is happening now.

ubuntu@tesseract-ocr:~/TEST$ tesseract TOC.png - --psm 6 --tessdata-dir ~/tessdata_best --oem 0
Error: Tesseract (legacy) engine requested, but components are not present in /home/ubuntu/tessdata_best/eng.traineddata!!
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.
ubuntu@tesseract-ocr:~/TEST$ tesseract TOC.png - --psm 6 --tessdata-dir ~/tessdata_fast --oem 0
Error: Tesseract (legacy) engine requested, but components are not present in /home/ubuntu/tessdata_fast/eng.traineddata!!
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.
ubuntu@tesseract-ocr:~/TEST$

Thank you!

Shreeshrii commented 5 years ago

The main issue reported here though is with the recognition of this image with default psm of 3. That problem still exists.

TOC

ubuntu@tesseract-ocr:~/TEST$ tesseract TOC.png - --psm 6 --tessdata-dir ~/tessdata_best
1 First chapter 3
1.1 Section One 3
1.2 Section Two 3
1.3 Section Three 3

2 Last chapter 5
2.1 Section One 5
22 Section Two 5
23 Section Three 5
ubuntu@tesseract-ocr:~/TEST$ tesseract TOC.png - --psm 6 --tessdata-dir ~/tessdata_fast
1 First chapter 3
1.1 Section One 3
1.2 Section Two 3
1.3 Section Three 3

2 Last chapter 5
2.1 Section One 5
22 Section Two 5
23 Section Three 5
ubuntu@tesseract-ocr:~/TEST$ tesseract TOC.png - --psm 6 --tessdata-dir ~/tessdata
1 First chapter 3
1.1 Section One 3
1.2 Section Two 3
1.3 Section Three 3

2 Last chapter 5
2.1 Section One 5
22 Section Two 5
23 Section Three 5

--psm 3 is giving incorrect output.

ubuntu@tesseract-ocr:~/TEST$ tesseract TOC.png - --psm 3 --tessdata-dir ~/tessdata
Detected 2 diacritics
1 First chapter

e w e

Qe
ubuntu@tesseract-ocr:~/TEST$ tesseract TOC.png - --psm 3 --tessdata-dir ~/tessdata_fast
Detected 2 diacritics
1 First chapter

wo 0 eo 00

cena
ubuntu@tesseract-ocr:~/TEST$ tesseract TOC.png - --psm 3 --tessdata-dir ~/tessdata_best
Detected 2 diacritics
1 First chapter

wows

aaa
ubuntu@tesseract-ocr:~/TEST$ tesseract TOC.png - --psm 3 --tessdata-dir ~/tessdata --oem 0
Detected 2 diacritics
1 mm chépter

wwww

macaw
Shreeshrii commented 5 years ago

Another psm related test case is at https://github.com/tesseract-ocr/tesseract/issues/2639#issuecomment-544093548

Shreeshrii commented 3 years ago

Problem still exists in recognizing this image with --psm 3 (default).

(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract 1327.png  - --tessdata-dir ../tessdata -l eng
Detected 2 diacritics
1 First chapter

e w e

EREr)
(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract 1327.png  - --tessdata-dir ../tessdata_best -l eng
Detected 2 diacritics
1 First chapter

wow

ERE)
(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract 1327.png  - --tessdata-dir ../tessdata_fast -l eng
Detected 2 diacritics
1 First chapter

wo 0 co 00

cena
(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract 1327.png  - --tessdata-dir ../tessdata_fast -l eng --psm 6
1 First chapter 3
1.1 Section One 3
1.2 Section Two 3
1.3 Section Three 3

2 Last chapter 5
21 Section One 5
22 Section Two 5
23° Section Three 5
(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract 1327.png  - --tessdata-dir ../tessdata_best -l eng --psm 6
1 First chapter 3
1.1 Section One 3
1.2 Section Two 3
1.3 Section Three 3

2 Last chapter 5
2.1 Section One 5
22 Section Two 5
23 Section Three 5
(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract 1327.png  - --tessdata-dir ../tessdata -l eng --psm 6
1 First chapter 3
1.1 Section One 3
1.2 Section Two 3
1.3 Section Three 3