Open Shreeshrii opened 6 years ago
File: ./toc.png **** ./toc.png /tessdata_fast/ ** Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica Detected 2 diacritics
** --oem 0 psm 1
Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.
** --oem 0 psm 3
Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.
** --oem 0 psm 4
Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.
** --oem 0 psm 6
Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.
** --oem 0 psm 11
Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.
** --oem 0 psm 12
Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.
** --oem 1 psm 1
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica OSD: Weak margin (0.77) for 37 blob text block, but using orientation anyway: 0 Detected 2 diacritics
** --oem 1 psm 3
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica Detected 2 diacritics
** --oem 1 psm 4
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica Detected 2 diacritics
** --oem 1 psm 6
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
** --oem 1 psm 11
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica Detected 2 diacritics
** --oem 1 psm 12
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica OSD: Weak margin (0.77) for 37 blob text block, but using orientation anyway: 0 Detected 2 diacritics
With --oem 0, the error is
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.
Instead, the message should indicate that
Requested Legacy Tesseract model does not exist in traineddata.
Using LSTM engine and model instead.
and change the --oem value internally before processing.
Also, the default of --psm 3 for command line does not give the best results with LSTM engine. Please consider whether it needs to be changed to --psm 6 as default.
default result
1 First chapter
wo 0 eo 00
cena
--psm 6
1 First chapter 3
1.1 Section One 3
1.2 Section Two 3
1.3 Section Three 3
2 Last chapter 5
2.1 Section One 5
22 Section Two 5
23 Section Three 5
Attaching the OCRed text from the above image.
toc-eng-oem-1-psm-6.txt toc-eng-oem-1-psm-4.txt toc-eng-oem-1-psm-3.txt toc-eng-oem-1-psm-1.txt toc-eng-default.txt toc-eng-oem-1-psm-12.txt toc-eng-oem-1-psm-11.txt
Retested with latest code - the issues with --psm 3 still exist.
tesseract 4.1.0-rc1-250-g95a1
leptonica-1.76.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0
***** ./English-TOC.png LANG eng TESSDATA tessdata OEM 0 PSM 3 ****
Detected 2 diacritics
1 mm chépter
wwww
macaw
***** ./English-TOC.png LANG eng TESSDATA tessdata OEM 0 PSM 6 ****
1 mm chapter 3
1.1 SectionOne 3
1.2 Section'l‘wn 3
1.3 Section'l'hme 3
2 Last chapter 5
2.1 SectionOne 5
2.2 Section'l‘wn 5
2.3 Section'l'hme 5
***** ./English-TOC.png LANG eng TESSDATA tessdata OEM 1 PSM 3 ****
Detected 2 diacritics
1 First chapter
wesw
aaa
***** ./English-TOC.png LANG eng TESSDATA tessdata OEM 1 PSM 6 ****
1 First chapter 3
1.1 Section One 3
1.2 Section Two 3
1.3 Section Three 3
2 Last chapter 5
2.1 Section One 5
22 Section Two 5
23 Section Three 5
***** ./English-TOC.png LANG eng TESSDATA tessdata_best OEM 0 PSM 3 ****
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.
***** ./English-TOC.png LANG eng TESSDATA tessdata_best OEM 0 PSM 6 ****
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.
***** ./English-TOC.png LANG eng TESSDATA tessdata_best OEM 1 PSM 3 ****
Detected 2 diacritics
1 First chapter
wows
aaa
***** ./English-TOC.png LANG eng TESSDATA tessdata_best OEM 1 PSM 6 ****
1 First chapter 3
1.1 Section One 3
1.2 Section Two 3
1.3 Section Three 3
2 Last chapter 5
2.1 Section One 5
22 Section Two 5
23 Section Three 5
***** ./English-TOC.png LANG eng TESSDATA tessdata_fast OEM 0 PSM 3 ****
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.
***** ./English-TOC.png LANG eng TESSDATA tessdata_fast OEM 0 PSM 6 ****
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.
***** ./English-TOC.png LANG eng TESSDATA tessdata_fast OEM 1 PSM 3 ****
Detected 2 diacritics
1 First chapter
wo 0 eo 00
cena
***** ./English-TOC.png LANG eng TESSDATA tessdata_fast OEM 1 PSM 6 ****
1 First chapter 3
1.1 Section One 3
1.2 Section Two 3
1.3 Section Three 3
2 Last chapter 5
2.1 Section One 5
22 Section Two 5
23 Section Three 5
DONE
Also note, while --oem 0
gets the numbered sections 2.2 and 2.3 correct, the text on lines is incorrect. --oem 1
gets the section numbers as 22 and 23 but the text is correct.
Also see https://github.com/tesseract-ocr/tesseract/issues/2381 Issue with chi_tra tessdata_fast and --psm 3
@Shreeshrii: 0150fc57ccbdbf64381ad534f969a63c9942e3b7 should report if tesseract engine (legacy) is not present. But I am not sure about auto usage of LSTM (or vice versa). When user specifies oem
there should be reason for it => exit is more reasonable behavior for me that continue with not desired model.
@zdenop >report if tesseract engine (legacy) is not present.
Yes that is happening now.
ubuntu@tesseract-ocr:~/TEST$ tesseract TOC.png - --psm 6 --tessdata-dir ~/tessdata_best --oem 0
Error: Tesseract (legacy) engine requested, but components are not present in /home/ubuntu/tessdata_best/eng.traineddata!!
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.
ubuntu@tesseract-ocr:~/TEST$ tesseract TOC.png - --psm 6 --tessdata-dir ~/tessdata_fast --oem 0
Error: Tesseract (legacy) engine requested, but components are not present in /home/ubuntu/tessdata_fast/eng.traineddata!!
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.
ubuntu@tesseract-ocr:~/TEST$
Thank you!
The main issue reported here though is with the recognition of this image with default psm of 3. That problem still exists.
ubuntu@tesseract-ocr:~/TEST$ tesseract TOC.png - --psm 6 --tessdata-dir ~/tessdata_best
1 First chapter 3
1.1 Section One 3
1.2 Section Two 3
1.3 Section Three 3
2 Last chapter 5
2.1 Section One 5
22 Section Two 5
23 Section Three 5
ubuntu@tesseract-ocr:~/TEST$ tesseract TOC.png - --psm 6 --tessdata-dir ~/tessdata_fast
1 First chapter 3
1.1 Section One 3
1.2 Section Two 3
1.3 Section Three 3
2 Last chapter 5
2.1 Section One 5
22 Section Two 5
23 Section Three 5
ubuntu@tesseract-ocr:~/TEST$ tesseract TOC.png - --psm 6 --tessdata-dir ~/tessdata
1 First chapter 3
1.1 Section One 3
1.2 Section Two 3
1.3 Section Three 3
2 Last chapter 5
2.1 Section One 5
22 Section Two 5
23 Section Three 5
--psm 3 is giving incorrect output.
ubuntu@tesseract-ocr:~/TEST$ tesseract TOC.png - --psm 3 --tessdata-dir ~/tessdata
Detected 2 diacritics
1 First chapter
e w e
Qe
ubuntu@tesseract-ocr:~/TEST$ tesseract TOC.png - --psm 3 --tessdata-dir ~/tessdata_fast
Detected 2 diacritics
1 First chapter
wo 0 eo 00
cena
ubuntu@tesseract-ocr:~/TEST$ tesseract TOC.png - --psm 3 --tessdata-dir ~/tessdata_best
Detected 2 diacritics
1 First chapter
wows
aaa
ubuntu@tesseract-ocr:~/TEST$ tesseract TOC.png - --psm 3 --tessdata-dir ~/tessdata --oem 0
Detected 2 diacritics
1 mm chépter
wwww
macaw
Another psm related test case is at https://github.com/tesseract-ocr/tesseract/issues/2639#issuecomment-544093548
Problem still exists in recognizing this image with --psm 3 (default).
(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract 1327.png - --tessdata-dir ../tessdata -l eng
Detected 2 diacritics
1 First chapter
e w e
EREr)
(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract 1327.png - --tessdata-dir ../tessdata_best -l eng
Detected 2 diacritics
1 First chapter
wow
ERE)
(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract 1327.png - --tessdata-dir ../tessdata_fast -l eng
Detected 2 diacritics
1 First chapter
wo 0 co 00
cena
(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract 1327.png - --tessdata-dir ../tessdata_fast -l eng --psm 6
1 First chapter 3
1.1 Section One 3
1.2 Section Two 3
1.3 Section Three 3
2 Last chapter 5
21 Section One 5
22 Section Two 5
23° Section Three 5
(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract 1327.png - --tessdata-dir ../tessdata_best -l eng --psm 6
1 First chapter 3
1.1 Section One 3
1.2 Section Two 3
1.3 Section Three 3
2 Last chapter 5
2.1 Section One 5
22 Section Two 5
23 Section Three 5
(base) ubuntu@tesseract-ocr-1:~/TEST$ tesseract 1327.png - --tessdata-dir ../tessdata -l eng --psm 6
1 First chapter 3
1.1 Section One 3
1.2 Section Two 3
1.3 Section Three 3
Tesseract commit # https://github.com/tesseract-ocr/tesseract/commit/a50ff5277daad5d9831f82fc07cb39ba8f1ed589
-l eng
Using traineddata files from tessdata_fast
test image attached: