Tesseract should recognize common ligatures for improved recognition rates

patcharats / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr

Other

0 stars 0 forks source link

Tesseract should recognize common ligatures for improved recognition rates #69

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
Sample a source document composed with a font containing ligatures, such as 
"ffi" in efficient, "ffl" 
in affluent, etc. 

What is the expected output? What do you see instead?
"ffi" is often seen as "fh" for example.

What version of the product are you using? On what operating system?
2.01 on Mac OS 10.4.

Please provide any additional information below.

Original issue reported on code.google.com by mrondine...@gmail.com on 23 Sep 2007 at 10:07

GoogleCodeExporter commented 9 years ago

You can't even train ligatures longer than 2 characters.

Original comment by gtrw...@gmail.com on 15 Mar 2008 at 7:17

GoogleCodeExporter commented 9 years ago

2.03 accepts ligatures up to 24 bytes of utf-8.
More work is being done in 3.00 to improve accuracy on ligatures.

Original comment by theraysm...@gmail.com on 28 Dec 2008 at 7:22

Changed state: Started

GoogleCodeExporter commented 9 years ago

Fixed in 3.00.

Original comment by theraysm...@gmail.com on 20 May 2010 at 7:00

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

For the Danish training data for download at least, it now overzealously 
interprets "fi" as the ligature "ﬁ", even though that ligature is practically 
never encountered in Danish texts (while "fi" is very common, and kerning often 
causes the top right of the "f" to overlap the dot of the "i").

It can be argued that this isn't a bug; afterall, there is no way to tell from 
the individual shape that it isn't a ligature - but the language makes it very 
unlikely. It's easy to work around by replacing "ﬁ" with "fi". Still, it 
would be nice if "ﬁ" in Danish was only used when the full word would then 
match a word in the dictionary.

Original comment by joakim.a...@gmail.com on 18 Jan 2011 at 11:32

GoogleCodeExporter commented 9 years ago

You can blacklist them 
https://groups.google.com/forum/#!topic/tesseract-ocr/jO_4ZMMK9xw

but http://stb-tester.com/blog/2014/04/14/improving-ocr-accuracy says it's 
better to detect them as ligatures and then post-process to expand them to 
normal characters afterward.  It seems you have to do that yourself, though.  
Here's their python file that does it: 
https://github.com/stb-tester/stb-tester/blob/91f7f22309ada6fc2f1b35f4321c49dc7b
32fb5c/stbt.py#L677

Original comment by omegat...@gmail.com on 11 Apr 2015 at 12:44