Totally wrong number recognition

GoogleCodeExporter commented 8 years ago

I'm working with the english.traineddata

In the following images Tesseract recognizes each glyph correctly but in the 
word chopping and number recognition there are severe problems.

1.)
The strangest thing is that it depends on a comma or dot following the first 
number if the first glyph is recognized as a dollar or as a five!!

2.)
The second bug is that it depends on a comma or a dot if a price is split into 
two words or if it is recognized correctly as one word. (see yellow rectangles 
= word boundaries)

3.)
The third bug is that the recognition of dollar and the word chopping depends 
on the image containing one line of text or two lines of text.

4.)
Additionally it depends on the page mode.
The following images are recognized with PSM_SIMGLE_BLOCK. 

When I switch to PSM_AUTO both images with dot are recognized correctly and 
both images with comma are recognized wrongly.

All this is totally buggy. It seems that the criteria used by Tesseract are 
very weak and depend on many random factors.

My conclusion is that the glyph detection works correctly but in a later step 
the interpretation of the similar glyphs (e.g. "5" versus "$") and word 
chopping is failing completely.

I observe this problem very frequently.

I uploaded the original images in a ZIP so you can verify it.

Original issue reported on code.google.com by smaragds...@gmail.com on 21 Sep 2014 at 5:04

Attachments:

GoogleCodeExporter commented 8 years ago

Here another example for a totally wrong word separation:

Original comment by smaragds...@gmail.com on 23 Sep 2014 at 7:01

Attachments:

TotallyBuggy.png

GoogleCodeExporter commented 8 years ago

.... and then a slightly modified preprocessing of the image and the same text 
is suddenly recognized correctly. If you compare the two images you see that 
there are some additional random pixels. This is the only difference between 
them.

This confirms what i wrote above:
It seems that the criteria used by Tesseract are very weak and depend on many 
random factors.

There is definitely no space between "J" and "eff" and "rey".
So why does Tesseract split the text that is clearly one single word?

Original comment by smaragds...@gmail.com on 23 Sep 2014 at 3:21

Attachments:

TotallyCorrect.png

GoogleCodeExporter commented 8 years ago

Here another weird example.

One of the images has the text a little bit bolder than the other one. But the 
result of the bold text is that Tesseract does not recognize ANY character.

I would understand if Tesseract would recognize the "g" wrongly like an "8". 
But why does it not recognize ANY of all the other characters?

Original comment by smaragds...@gmail.com on 23 Sep 2014 at 9:13

Attachments:

GoogleCodeExporter commented 8 years ago

[deleted comment]

GoogleCodeExporter commented 8 years ago

[deleted comment]

GoogleCodeExporter commented 8 years ago

[deleted comment]

GoogleCodeExporter commented 8 years ago

And here is another related bug.
The image below has been analyzed in PSM_AUTO mode.

The original image has two horizontal rulers which are detected correctly.
But the further processing ignores them totally.

Between "Coffee" and "Subtotal" is a ruler and a new paragraph should start.
The same applies between "Tax" and "Total".

But the first horizontal ruler is ignored.
Instead a new paragraph starts after the first text line between "Chicken" and 
"Chips". This is totally wrong. The first 4 lines have exactly the same 
distance. They should be in the same paragraph.

See image: (blue = text block, green = paragraph, red = text line, yellow = 
word)

Original comment by smaragds...@gmail.com on 25 Sep 2014 at 7:05

Attachments:

BuggyAutoMode.png

GoogleCodeExporter commented 8 years ago

This bug is in state "New" since 6 months.

I have the impression that posting bug reports here is completely in vain.

Original comment by smaragds...@gmail.com on 21 Mar 2015 at 12:53

justaddcoffee / tesseract-ocr

Totally wrong number recognition #1322