raffaeldantas / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
1 stars 0 forks source link

ocr of real (old) printing, but dirty #1455

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
I work with tesseract 3.02.02 on SUSE Linux 13.2

the text to be ocr'd is real printed text of about 1930.
the printing is a little dirty i.e. there are little points and strokes between 
the letters.
though these are far smaller than the other letters, they are interpreted as 
normal letters.

The normal letters are recognized fairly good

as an example:
the picture appended is translated to the text
  15 Ellser Exdmsund Mögsgzerg

Is there a possibility to give parameters to tesseract that it 
. either should neglect letters which do not fit the majority of the other 
  letters, 
. or it should only use letters in a given range of size
. or to firstly make the boxes, 
  then correct the boxes, by hand or program,
  finally translate using the corrected boxes

I have already tried with a config-file containing
  textord_min_xheight 26
  textord_xheight_mode_fraction 0.9
  textord_xheight_error_margin 0.1
  textord_descx_ratio_min 0.3
  textord_descx_ratio_max 0.6
  textord_ascx_ratio_min 1.3
  textord_ascx_ratio_max 1.7
  load_system_dawg F
  load_freq_dawg F
it changes some things but nothing to neglect the points and strokes

I also tried to make the boxes, correct them by erasing the false letters
and then translate with these boxes by using a config file containing:
  tessedit_make_boxes_from_boxes T
but this doesnt what i want.
Is there a poosibility to accomplish this?

a solution with a dictionary is not possible, because the text consists of only 
names of persons and locations.

Another thing i wonder is:
when i ocr an image from .tiff to .txt
and makebox of the same image
some (few) letters are different recognized!

thanks for help in advance

Original issue reported on code.google.com by pj...@aon.at on 19 Apr 2015 at 12:54

Attachments:

GoogleCodeExporter commented 8 years ago
Please tesseract forum for support asking support[1].

[1] https://code.google.com/p/tesseract-ocr/wiki/FAQ#Rules_and_advices

Original comment by zde...@gmail.com on 19 Apr 2015 at 8:40