tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.72k stars 9.45k forks source link

OSD not working again with --psm 0 after latest 20181030 binary release #2062

Closed CanadianHusky closed 4 years ago

CanadianHusky commented 5 years ago

Environment

Binary release clean install from

https://github.com/UB-Mannheim/tesseract/wiki https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-v4.0.0.20181030.exe

Current Behavior:

orientation is detected wrong in supplied file with shown command line

image

WRONG Result :

Page number: 0
Orientation in degrees: 270
Rotate: 90
Orientation confidence: 14.00
Script: Latin
Script confidence: nan

Expected Behavior:

compare the same input against 4.0.0-rc1 image

CORRECT Result :

Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 0.54
Script: Latin
Script confidence: 33.33

the orientation confidence value based on tests on thousdands of files in rc1 version is extremely accurate and makes sense. It is used as a threshold if the result can be trusted or not the result from 20181030 release is horribly mistaken

Input Image :

image

Suggested Fix:

invesigate what lead to regression in OSD code

thank you kindly

stweil commented 5 years ago

This could be related to the changed handling of the alpha channel in PNG images: the latest Tesseract code replaces the alpha channel by white.

@CanadianHusky, could you please try both versions with the same image in other formats (for example JPEG or TIFF) or with a PNG without alpha channel?

CanadianHusky commented 5 years ago

Hello,

@stweil I have tested RC3 and RC4 and the final version 4-20181030 builds. I used BMP and JPG input of the same image. All of them suffer from the same problem and fail to detect orientation correctly, that used to be working in RC1 The problem must have been introduced somewhere between the date ranges of RC1 and RC3 thank you

CanadianHusky commented 5 years ago

Hello, I see a new pre-compiled release at https://digi.bib.uni-mannheim.de/tesseract/ for

tesseract-ocr-w64-setup-v4.1.0.20190314.exe

and tested that release against the issue mentioned above.

The result on the input image is still incorrect. I am unsure if the binary release I have used is really a 4.1.0 release or if this an intermediary build.

thank you

stweil commented 5 years ago

That binary is based on latest Tesseract sources (Git master).

zdenop commented 5 years ago

@CanadianHusky: you can copy and paste terminal output by mouse select (with left button, and if you then click with right in terminal you have selection in clipboard) - it is more useful than screenshots.

I made test with the latest code (5.0.0-alpha-50-g3f4dc) and best tessdata:

> tesseract i2062.png - --dpi 175 -c min_characters_to_try=10 --psm 0 -l eng
Warning, detects only orientation with -l eng
Warning. Invalid resolution 0 dpi. Using 70 instead.
Page number: 0
Orientation in degrees: 270
Rotate: 90
Orientation confidence: 14.00
Script: Latin
Script confidence: -nan(ind)

But if I skip language specification (eng should be used anyway) I got different result:

> tesseract i2062.png - --dpi 175 -c min_characters_to_try=10 --psm 0
Warning. Invalid resolution 0 dpi. Using 70 instead.
Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 0.28
Script: Greek
Script confidence: 4.36

Detection of orientation is correct, but script is wrong. This is quiet strange that specification of eng language is cause different result...

zdenop commented 5 years ago

And using tessdata (e.g. not fast, not best) provide correct result:

tesseract i2062.png - --psm 0 --tessdata-dir tessdata -c min_characters_to_try=10 -l eng
Warning, detects only orientation with -l eng
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 174
Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 0.54
Script: Latin
Script confidence: 33.33

Seems like LSTM model is not able to detect correctly orientation on this kind of images (Too few characters), but legacy is working fine:

pi@raspberrypi:/usr/src/test $ tesseract i2062.png - --psm 0 --tessdata-dir tessdata --oem 0 --dpi 175 -c min_characters_to_try=10 -l eng
Warning, detects only orientation with -l eng
Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 0.54
Script: Latin
Script confidence: 33.33
pi@raspberrypi:/usr/src/test $ tesseract i2062.png - --psm 0 --tessdata-dir tessdata --oem 1 --dpi 175 -c min_characters_to_try=10 -l eng
Warning, detects only orientation with -l eng
Page number: 0
Orientation in degrees: 270
Rotate: 90
Orientation confidence: 14.00
Script: Latin
Script confidence: nan
pi@raspberrypi:/usr/src/test $ tesseract i2062.png - --psm 0 --tessdata-dir tessdata --oem 2 --dpi 175 -c min_characters_to_try=10 -l eng
Warning, detects only orientation with -l eng
Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 0.54
Script: Latin
Script confidence: 33.33
pi@raspberrypi:/usr/src/test $ tesseract i2062.png - --psm 0 --tessdata-dir tessdata --oem 3 --dpi 175 -c min_characters_to_try=10 -l eng
Warning, detects only orientation with -l eng
Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 0.54
Script: Latin
Script confidence: 33.33
zdenop commented 5 years ago

More details, that can bring some light how it works:

If there is not language specification - only osd.traineddata is used (according strace report) That is reason why Script detection is not correct. When there is specification of language -l eng then:

I am not sure if we can/want do something with this.

CanadianHusky commented 5 years ago

As soon as I see a stable binary release that I can test, I will try those suggested command line options. if using --oem option with the correct value is able to detect correct orientation and a reasonable confidence value, that is sufficient. It does not matter to me personally if the detection is done with LSTM or legacy code. Of course it is very desirable that this sort of orientation detection works as fast as possible. I appreciate the provided information. Thank you @zdenop

zdenop commented 5 years ago

If my observation is correct you do not need to wait for stable release: just use tessdata repository for OSD.

stweil commented 5 years ago

@zdenop, it is normal that only osd.traineddata is used if no explicit language was given. That file includes a selection of more than 1700 unicode characters from different scripts which are used to detect the right script. It is only available for the legacy OCR engine. Therefore it won't work if you use --oem 1 or compile Tesseract without that engine.

My tests with latest Tesseract code all give the right orientation as long as I do not add --oem 1.

zdenop commented 5 years ago

So what is the status of this issue? Can it be closed?

stweil commented 5 years ago

@CanadianHusky, do you still have that problem?

CanadianHusky commented 5 years ago

Orientation detection still has problems for me. Here are my test results, after having adjusted the command line as recommended by @stweil

Test environment : clean install from https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-v5.0.0.20190623.exe

image

all 3 input images are 0 degrees, but get detected with incorrected result. I admit that input 3 image is poor quality and a higher preprocessing resolution does find the correct result. However input 2 and 4 are as good as its going to get images with clean and large enough letters that I would have liked to see a correct result.

Am I still doing something wrong in the command line ?

input2 image : image

input 3 image : image

input 4 image : image

also worth noting, adding -l eng (or -l deu) changes the orientation detection result, still to an incorrect result, but very high confidence.

image

Shreeshrii commented 5 years ago

Please see https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/9HSpp7Ysduw/r8FPCHhBFAAJ

It might be related to this OSD related issue.

amitdo commented 4 years ago

Reading @zdenop and @stweil comment, it seems that there in no regression in newer versions with the first image in this issue.

Nobody commented about the other images. It is not clear if the OP claims that there is a regression here too, or just complains about the wrong result.

amitdo commented 4 years ago

I tested the input2 image.

I got correct result with:

tesseract input2.png input2 --psm 0 -l eng --tessdata-dir $testadadir/tessdata -c min_characters_to_try=10

console:

Warning, detects only orientation with -l eng
Tesseract Open Source OCR Engine v5.0.0-alpha-580-g87841 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 225
Warning. Invalid resolution 0 dpi. Using 70 instead.

input2.osd

Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 1.36
Script: Latin
Script confidence: 29.17

I'm not going to bother testing more images.

CanadianHusky commented 4 years ago

Thank you for revisiting this issue. In the meantime I have discovered the source of the inconsistency. The issue is not a regression in the code itself but depends in which TRAINEDDATA file is used. When I do a clean install from https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-v5.0.0.20190623.exe or any recent release...

This data file is installed image

Now observe these tests, only -l eng changes. Expected result is 0 degrees and meaningful confidence value

C:\Program Files\Tesseract-OCR>tesseract --version
tesseract v5.0.0-alpha.20191030
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found AVX
 Found SSE
 Found libarchive 3.3.2 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5

C:\Program Files\Tesseract-OCR>tesseract --tessdata-dir "C:\Program Files\Tesseract-OCR\tessdata" --psm 0 -l eng -c min_characters_to_try=10 "input2.png" stdout
Warning, detects only orientation with -l eng
Page number: 0
Orientation in degrees: 270
Rotate: 90
Orientation confidence: 50.00
Script: Latin
Script confidence: 2.00

WRONG 

C:\Program Files\Tesseract-OCR>tesseract --tessdata-dir "C:\Program Files\Tesseract-OCR\tessdata" --psm 0 -l eng_15040 -c min_characters_to_try=10 "input2.png" stdout
Warning, detects only orientation with -l eng_15040
Page number: 0
Orientation in degrees: 270
Rotate: 90
Orientation confidence: 50.00
Script: Latin
Script confidence: 2.00

WRONG

C:\Program Files\Tesseract-OCR>tesseract --tessdata-dir "C:\Program Files\Tesseract-OCR\tessdata" --psm 0 -l eng_22917 -c min_characters_to_try=10 "input2.png" stdout
Warning, detects only orientation with -l eng_22917
Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 1.38
Script: Latin
Script confidence: 30.00

CORRECT!

Here the trained data files image

These are the files in tessdata and clearly the source of the issue for me is that the original file installed with the binary distribution does not give the expected result. File eng_22917 was downloaded seperately from the traineddata repository

I would be interested to know what size your eng.traineddata file is and where it is from.

The source for my trained data files are as follows:

https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata 22917kb and the only file that works for orientation detection probably because it has the legacy models that OSD code needs

https://github.com/tesseract-ocr/tessdata_fast/blob/master/eng.traineddata 4017kb, also part of the binary installation, does not work with --psm 0 for orientation detection purposes for me

https://github.com/tesseract-ocr/tessdata_best/blob/master/eng.traineddata 15040kb, does not work with --psm 0 for orientation detection purposes for me

It took me very long time to understand and figure out this issue. I hope this information helps someone else. I have closed the issue.

I suppose the question now becomes if it makes sense to add a note to the binary distribution or elsewhere in the release notes from @stweil that the included default traineddata file is the fast integer model, which is totally fine for most users when all thay want to do is regular OCR. For anyone that is interested in OSD only like me, the traineddata files that I linked to must be used as far as I see from my tests. Thanks again for having this pinned and looked into. Much appreciated.

amitdo commented 4 years ago

I would be interested to know what size your eng.traineddata file is and where it is from.

I used eng.traindata from the tessdata repo.

https://github.com/tesseract-ocr/tessdata/blob/d87b3cbc7555/eng.traineddata

Size: 24.5 MB (24,530,234 bytes