Open lsabouri74 opened 5 years ago
I cannot reproduce all of your results:
$ tesseract Failfile.tiff - --psm 0
Page 1
Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 4.77
Script: Cyrillic
Script confidence: 0.46
Page 2
Page number: 1
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 1.42
Script: Cyrillic
Script confidence: 1.52
Page 3
Page number: 2
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 15.43
Script: Latin
Script confidence: 7.78
Page 4
Page number: 3
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 17.69
Script: Latin
Script confidence: 22.50
I can reproduce the bad result for page 1, so that looks like a bug.
I have the same result as you if I don't specify any languages... Is that normal? I guess selecting the right language has some effect on OSD... See below command and result when using -l eng:
tesseract.exe Failfile.tiff - --psm 0 -l eng
Warning, detects only orientation with -l eng
Page 1
Page number: 0
Orientation in degrees: 270
Rotate: 90
Orientation confidence: 250.00
Script: Latin
Script confidence: -nan(ind)
Page 2
Page number: 1
Orientation in degrees: 270
Rotate: 90
Orientation confidence: 250.00
Script: Latin
Script confidence: -nan(ind)
Page 3
Page number: 2
Orientation in degrees: 270
Rotate: 90
Orientation confidence: 250.00
Script: Latin
Script confidence: -nan(ind)
Page 4
Page number: 3
Orientation in degrees: 270
Rotate: 90
Orientation confidence: 250.00
Script: Latin
Script confidence: -nan(ind)
Even with tesseract Failfile.tiff - --psm 0 -l eng
I always get the right orientation.
Don't use --psm 0 -l <lang>
with the tessdata_best or tessdata_fast repos. OSD uses the legacy engine and needs a legacy model. All model in these repos doesn't have legacy data.
Same rule with --psm 1 -l <lang>
and --psm 12 -l <lang>
.
The only exception to this rule is osd.traineddata
, which is a legacy model.
That's why it's okay to do:
tesserect in.png out --psm 0
which is equivalent to:
tesserect in.png out --psm 0 -l osd
with models from best/fast repos.
Even with
tesseract Failfile.tiff - --psm 0 -l eng
I always get the right orientation.
With a eng model from the tessdata repo, right?
@stweil,
$ tesseract Failfile.tiff - --psm 0 Page 1 Page number: 0 Orientation in degrees: 0 Rotate: 0 Orientation confidence: 4.77 Script: Cyrillic Script confidence: 0.46 Page 2 Page number: 1 Orientation in degrees: 0 Rotate: 0 Orientation confidence: 1.42 Script: Cyrillic Script confidence: 1.52
I can reproduce the bad result for page 1, so that looks like a bug.
Here, Tesseracr detects the script in the First two pages as Cyrillic instead of Latin.
The osd traineddata can identify a limited number of characters for each script. First two pages contain mostly uppercase letters in the Latin script, which I guess are misrepresented in the osd model. Cyrillic has some smallcaps letters that look like uppercase Latin letters.
Don't use
--psm 0 -l <lang>
with the tessdata_best or tessdata_fast repos. OSD uses the legacy engine and needs a legacy model. All model in these repos doesn't have legacy data. Same rule with--psm 1 -l <lang>
and--psm 12 -l <lang>
.The only exception to this rule is
osd.traineddata
, which is a legacy model.That's why it's okay to do:
tesserect in.png out --psm 0
which is equivalent to:tesserect in.png out --psm 0 -l osd
with models from best/fast repos.
Does it mean that if I am using best/fast repos, I should not use:
tesseract in.png out --psm 1 -l eng
Would you recommend using the default --psm 3 instead? Will it still do page orientation detection?
Does it mean that if I am using best/fast repos, I should not use: tesseract in.png out --psm 1 -l eng
Don't use this command with best/fast data.
Would you recommend using the default --psm 3 instead?
You can use it if you know that the image does not need to be rotated.
Will it still do page orientation detection?
No.
tesseract --help-psm
3 Fully automatic page segmentation, but no OSD. (Default)
In the fully automatic mode, Tesseract will often detect the orientation of segments but not always reliably. If you need to determine the dominant orientation of an image (a page can have both horizontal and vertical blocks), one approach will be to check the OCR results for each orientation and pick the one with best results. To save time, you may compute a score for each run using a heuristic model and a feature list. The feature list could include the lengths and counts of words that appear in a dictionary, words that can be parsed as legitimate numbers, dates, etc. Some domain specific knowledge can also be worked into the heuristic.
Suppose you have such a function to compute a number between 0 and 100 for each run using the OCR results of the run. You pick a threshold, say 75% that makes the orientation acceptable. You run the OCR without any rotations, and compute the the score form OCR results. If the score exceeds the threshold, then you stop and report the detected orientation (don't forget to add your rotation angle to the orientation that tesseract reports at page block level) .
If not, then rotate the image by 90 degrees (using leptonica, for instance) and pass the rotated image to tesseract again and repeat the computation as for the first step.
If none of the 4 orientations results in high enough score then you either give up or pick the orientation with the highest score.
We use this algorithm to detect and rotate scanned pages in a project. It works really well except for bad quality dark images (common in OCR of old documents). We have already processed in excess of 3 million pages using this method. The algorithm would sometimes break down and give wrong values, but we knew that going in and the small enough rate is acceptable to our client. It certainly eliminates a lot of manual sorting of paper.
On Tue, May 19, 2020 at 2:26 PM Amit D. notifications@github.com wrote:
Does it mean that if I am using best/fast repos, I should not use: tesseract in.png out --psm 1 -l eng
Don't use this command with best/fast data.
Would you recommend using the default --psm 3 instead?
You can use it if you know that the image does not need to be rotated.
Will it still do page orientation detection?
No.
tesseract --help-psm
3 Fully automatic page segmentation, but no OSD. (Default)
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/2615#issuecomment-631062629, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACAY4XMEABETDN5QQN3E4LTRSLTONANCNFSM4ILESVHA .
Environment
Current Behavior:
I am calling tesseract using the following command line:
tesseract.exe ".\Failfile.tiff" ".\out" --tessdata-dir ".\tessdata" -l eng --psm 1 --oem 1
All pages have the 0 orientation. In the output, I get gibberish for the first page and correct output for 3 subsequent pages. If using --psm 3 to disable OSD, i get the correct output. Option --psm 0 gives me the following output:
It seems like even if detection is also wrong for pages 2, 3 and 4, the output is ok, which is inconsistent... Here is the file to reproduce: Failfile.zip
NOTE: I tested with both tessdata_fast and tessdata_best with the same result.
Expected Behavior: