karldergrosse / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
0 stars 0 forks source link

check in xheight fix #18

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
OCRopus needs the xheight fix for one-line OCR (Ray knows what this means)

Original issue reported on code.google.com by tmb...@gmail.com on 16 Mar 2007 at 12:11

GoogleCodeExporter commented 9 years ago
I wonder if this is what I was running into with my OCR product. Would like 
some details.

Original comment by ScanH...@gmail.com on 21 Mar 2007 at 6:36

GoogleCodeExporter commented 9 years ago
One of the changes in 1.03 "improved" the x-height calculation, which resulted 
in an
increase in the probability that a text line is regarded as allcaps. While this 
made
overall improvement on documents that include a lot of small caps or all caps, 
it
makes things worse in some cases on small amounts of text.
I am still working on a compromise solution that will restore the previous 
operation
on small amounts of normal text, without compromising accuracy on smallcaps or
allcaps, (both of which tend to show up in small amounts) Unfortunately, it is 
very
difficult to tell the difference between an all (or small) caps word and a 
genuine
all x-height word.

Original comment by theraysm...@gmail.com on 29 Mar 2007 at 1:37

GoogleCodeExporter commented 9 years ago
Since vowels are so common, is it not possible to run a special check for one 
of them
being capitalized if the block is smaller than some low number of pixels? If 
even one
is capitalized, make the line lower-case. Also, is there any chance to make 
these
types of decisions tunable via a command-line/config-file option? That would 
give the
external application a chance to run it *both* ways if user specified "try 
harder" or
just the default if one wanted a "speedy" result.

Tess is improving at a deeper level than I even anticipated. Thank you, Ray.

Original comment by fil...@repairfaq.org on 16 Apr 2007 at 1:38

GoogleCodeExporter commented 9 years ago
1.04 has some improvements is this area, but there is still work to do.

Original comment by theraysm...@gmail.com on 17 May 2007 at 7:16

GoogleCodeExporter commented 9 years ago
v2.0 will introduce a BOOL_VAR called textord_ocropus_mode. When set to true, 
the
x-height calculation code will run as 1.03, when set to false (the default) it 
will
run in a differerent mode that is better (on average), but worse for the way 
ocropus
uses it (which defeats fix_xheight()).

Original comment by theraysm...@gmail.com on 13 Jul 2007 at 1:43

GoogleCodeExporter commented 9 years ago

Original comment by theraysm...@gmail.com on 18 Jul 2007 at 10:23