Superscript Line Problem

raffaeldantas / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr

Other

1 stars 0 forks source link

Superscript Line Problem #1411

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. OCR the attached image in PSM_SINGLE_BLOCK using tesseract 3.04 latest with 
default eng.traineddata

What is the expected output? What do you see instead?
It outputs "2\nx" instead of "x2"

What version of the product are you using? On what operating system?
3.04 latest on windows 8.1 pro

Original issue reported on code.google.com by Jimma...@gmail.com on 29 Jan 2015 at 6:52

Attachments:

superscript.jpg

GoogleCodeExporter commented 9 years ago

Here are my test (on Linux) with the latest code:
- psm from 0 to 4 produce no output:
- psm 5 output:
2
x
- psm 6 output:
2
x
- psm 7 output:
x2
- psm 8 output:
x2
- psm 9 output:
x2
- psm 10 output:
ﬁ

I run it from command like this: 'tesseract superscript.jpg - -psm 8'
I tried it also on Windows 7 pro and I got the same results. So I can not 
reproduce problem.

Original comment by zde...@gmail.com on 17 Apr 2015 at 7:55

Changed state: WorksForMe

GoogleCodeExporter commented 9 years ago

PSM_SINGLE_BLOCK (psm 6) is the problem. psm 7 and higher force horizontal text 
no matter what but are not applicable when scanning pages

Original comment by Jimma...@gmail.com on 17 Apr 2015 at 9:50

GoogleCodeExporter commented 9 years ago

superscript.jpg is not page neither block (paragraph)! 
If you instruct tesseract to analyze this image as several lines of text, it is 
your request and not tesseract failure.

Original comment by zde...@gmail.com on 18 Apr 2015 at 5:11

GoogleCodeExporter commented 9 years ago

A better example is attached and using the command "tesseract superscript.jpg - 
-psm 6" outputs:
"
aaaaa
2
ax
CCCCCC
"

Original comment by Jimma...@gmail.com on 18 Apr 2015 at 5:37

Attachments:

superscript.png

GoogleCodeExporter commented 9 years ago

I do not think it is correct typeset of superscript. You place 2 about x-height 
which is IMO wrong. If you do line segmentation it could be placed on separated 
line.

Have a look e.g. at wikipedia, how superscript should be typeset 
http://en.wikipedia.org/wiki/Subscript_and_superscript

If I correct typeset (see attachment) I got correct result:

aaaaa
ax2
cccccc

Original comment by zde...@gmail.com on 18 Apr 2015 at 7:16

Attachments:

superscript3.png

GoogleCodeExporter commented 9 years ago

Thanks for looking into this. Many fonts have the superscript above lowercase 
letters by 1-2px as there is no followed standard to how far it should be. Is 
there maybe an option to modify superscript detection parameters on tesseract?

Original comment by Jimma...@gmail.com on 19 Apr 2015 at 5:21

GoogleCodeExporter commented 9 years ago

I remember on tesseract forum somebody has problem that some diacritics mark 
(usually placed above letter e.g. á) - tesseract place it on separated line. 
There was solution to modified some parameter - unfortunately I can not find 
this conversation.
I will try to have a look on this later, so I change status of issue open...

Original comment by zde...@gmail.com on 19 Apr 2015 at 8:36

Changed state: New

GoogleCodeExporter commented 9 years ago

I tried searching and found this - 
https://code.google.com/p/tesseract-ocr/issues/detail?id=877 . 
textord_min_linesize seems to work but messes up the letter "a" even though it 
is a perfect character. Any reason why?

Using the command "tesseract superscript.png - -psm 6 config.txt" with 
config.txt having the contents "textord_min_linesize 2" it outputs:

"
aaaaa
3X2
CCCCCC
"

Original comment by Jimma...@gmail.com on 20 Apr 2015 at 12:23