raffaeldantas / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
1 stars 0 forks source link

Wrong page segmentation and chars recognition #1429

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
NOTE:TEST2.tif is obtained from TEST1.tif adding some white space to the left

What steps will reproduce the problem?
1. tesseract.exe TEST1.tif test1 -l ita
2. tesseract.exe TEST2.tif test2 -l ita

What is the expected output? What do you see instead?
I expect to get the same result, because I only change the page dimension. I 
get very different result.

What version of the product are you using? On what operating system?
tesseract 3.02
 leptonica-1.68 (Mar 14 2011, 10:43:03) [MSC v.1500 LIB Release 32 bit]
  libgif 4.1.6 : libjpeg 8c : libpng 1.4.3 : libtiff 3.9.4 : zlib 1.2.5
Windows 7

Please provide any additional information below.
I run tesseract with debug and it seems that tesseract cannot get the 
characters bounding.
Enclosed is a couple of screenshots where you can see the characters detection 
in test1 and in test2.
Are there some configuration flags that I can set to fix this?
In test1 you can notice that I also have problems with segmentation, because 
tesseract is splitting wrong some lines of text (ex: BOTTINELLI -> BOT  TINELLI 
and DESTINATARIO -> DES  TINATARIO) because of a wrong page segmentation. 
I've tryed other psm flag but nothing better that the default.
Again: Are there some configuration flags that I can set to fix this?

My big concern is that only adding some white space to my source image i get 
very different result.
This is confusing me! I supposed that I can remove borders to reduce image 
dimension.

Original issue reported on code.google.com by stefano....@gmail.com on 3 Mar 2015 at 8:11

Attachments: