tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.26k stars 9.51k forks source link

OSD not working with TIFF image containing only a block of text #2615

Open lsabouri74 opened 5 years ago

lsabouri74 commented 5 years ago

Environment

Current Behavior:

I am calling tesseract using the following command line: tesseract.exe ".\Failfile.tiff" ".\out" --tessdata-dir ".\tessdata" -l eng --psm 1 --oem 1

All pages have the 0 orientation. In the output, I get gibberish for the first page and correct output for 3 subsequent pages. If using --psm 3 to disable OSD, i get the correct output. Option --psm 0 gives me the following output:

Page number: 0 Orientation in degrees: 270 Rotate: 90 Orientation confidence: 250.00 Script: Latin Script confidence: -nan(ind) Page number: 1 Orientation in degrees: 270 Rotate: 90 Orientation confidence: 250.00 Script: Latin Script confidence: -nan(ind) Page number: 2 Orientation in degrees: 270 Rotate: 90 Orientation confidence: 250.00 Script: Latin Script confidence: -nan(ind) Page number: 3 Orientation in degrees: 270 Rotate: 90 Orientation confidence: 250.00 Script: Latin Script confidence: -nan(ind)

It seems like even if detection is also wrong for pages 2, 3 and 4, the output is ok, which is inconsistent... Here is the file to reproduce: Failfile.zip

NOTE: I tested with both tessdata_fast and tessdata_best with the same result.

Expected Behavior:

  1. I would expect OSD to work on a simple, clean document containing only a block of text
  2. I would expect OSD behavior to be consistent on all pages
stweil commented 5 years ago

I cannot reproduce all of your results:

$ tesseract Failfile.tiff - --psm 0
Page 1
Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 4.77
Script: Cyrillic
Script confidence: 0.46
Page 2
Page number: 1
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 1.42
Script: Cyrillic
Script confidence: 1.52
Page 3
Page number: 2
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 15.43
Script: Latin
Script confidence: 7.78
Page 4
Page number: 3
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 17.69
Script: Latin
Script confidence: 22.50

I can reproduce the bad result for page 1, so that looks like a bug.

lsabouri74 commented 5 years ago

I have the same result as you if I don't specify any languages... Is that normal? I guess selecting the right language has some effect on OSD... See below command and result when using -l eng:

tesseract.exe Failfile.tiff - --psm 0 -l eng
Warning, detects only orientation with -l eng
Page 1
Page number: 0
Orientation in degrees: 270
Rotate: 90
Orientation confidence: 250.00
Script: Latin
Script confidence: -nan(ind)
Page 2
Page number: 1
Orientation in degrees: 270
Rotate: 90
Orientation confidence: 250.00
Script: Latin
Script confidence: -nan(ind)
Page 3
Page number: 2
Orientation in degrees: 270
Rotate: 90
Orientation confidence: 250.00
Script: Latin
Script confidence: -nan(ind)
Page 4
Page number: 3
Orientation in degrees: 270
Rotate: 90
Orientation confidence: 250.00
Script: Latin
Script confidence: -nan(ind)
stweil commented 5 years ago

Even with tesseract Failfile.tiff - --psm 0 -l eng I always get the right orientation.

amitdo commented 4 years ago

Don't use --psm 0 -l <lang> with the tessdata_best or tessdata_fast repos. OSD uses the legacy engine and needs a legacy model. All model in these repos doesn't have legacy data. Same rule with --psm 1 -l <lang> and --psm 12 -l <lang>.

The only exception to this rule is osd.traineddata, which is a legacy model.

That's why it's okay to do:

tesserect in.png out --psm 0 which is equivalent to: tesserect in.png out --psm 0 -l osd

with models from best/fast repos.

amitdo commented 4 years ago

Even with tesseract Failfile.tiff - --psm 0 -l eng I always get the right orientation.

With a eng model from the tessdata repo, right?

amitdo commented 4 years ago

@stweil,

$ tesseract Failfile.tiff - --psm 0
Page 1
Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 4.77
Script: Cyrillic
Script confidence: 0.46
Page 2
Page number: 1
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 1.42
Script: Cyrillic
Script confidence: 1.52

I can reproduce the bad result for page 1, so that looks like a bug.

Here, Tesseracr detects the script in the First two pages as Cyrillic instead of Latin.

The osd traineddata can identify a limited number of characters for each script. First two pages contain mostly uppercase letters in the Latin script, which I guess are misrepresented in the osd model. Cyrillic has some smallcaps letters that look like uppercase Latin letters.

lsabouri74 commented 4 years ago

Don't use --psm 0 -l <lang> with the tessdata_best or tessdata_fast repos. OSD uses the legacy engine and needs a legacy model. All model in these repos doesn't have legacy data. Same rule with --psm 1 -l <lang> and --psm 12 -l <lang>.

The only exception to this rule is osd.traineddata, which is a legacy model.

That's why it's okay to do:

tesserect in.png out --psm 0 which is equivalent to: tesserect in.png out --psm 0 -l osd

with models from best/fast repos.

Does it mean that if I am using best/fast repos, I should not use: tesseract in.png out --psm 1 -l eng

Would you recommend using the default --psm 3 instead? Will it still do page orientation detection?

amitdo commented 4 years ago

Does it mean that if I am using best/fast repos, I should not use: tesseract in.png out --psm 1 -l eng

Don't use this command with best/fast data.

Would you recommend using the default --psm 3 instead?

You can use it if you know that the image does not need to be rotated.

Will it still do page orientation detection?

No.

tesseract --help-psm

3 Fully automatic page segmentation, but no OSD. (Default)

FarhadKhalafi commented 4 years ago

In the fully automatic mode, Tesseract will often detect the orientation of segments but not always reliably. If you need to determine the dominant orientation of an image (a page can have both horizontal and vertical blocks), one approach will be to check the OCR results for each orientation and pick the one with best results. To save time, you may compute a score for each run using a heuristic model and a feature list. The feature list could include the lengths and counts of words that appear in a dictionary, words that can be parsed as legitimate numbers, dates, etc. Some domain specific knowledge can also be worked into the heuristic.

Suppose you have such a function to compute a number between 0 and 100 for each run using the OCR results of the run. You pick a threshold, say 75% that makes the orientation acceptable. You run the OCR without any rotations, and compute the the score form OCR results. If the score exceeds the threshold, then you stop and report the detected orientation (don't forget to add your rotation angle to the orientation that tesseract reports at page block level) .

If not, then rotate the image by 90 degrees (using leptonica, for instance) and pass the rotated image to tesseract again and repeat the computation as for the first step.

If none of the 4 orientations results in high enough score then you either give up or pick the orientation with the highest score.

We use this algorithm to detect and rotate scanned pages in a project. It works really well except for bad quality dark images (common in OCR of old documents). We have already processed in excess of 3 million pages using this method. The algorithm would sometimes break down and give wrong values, but we knew that going in and the small enough rate is acceptable to our client. It certainly eliminates a lot of manual sorting of paper.

On Tue, May 19, 2020 at 2:26 PM Amit D. notifications@github.com wrote:

Does it mean that if I am using best/fast repos, I should not use: tesseract in.png out --psm 1 -l eng

Don't use this command with best/fast data.

Would you recommend using the default --psm 3 instead?

You can use it if you know that the image does not need to be rotated.

Will it still do page orientation detection?

No.

tesseract --help-psm

3 Fully automatic page segmentation, but no OSD. (Default)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/2615#issuecomment-631062629, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACAY4XMEABETDN5QQN3E4LTRSLTONANCNFSM4ILESVHA .