Closed GoogleCodeExporter closed 9 years ago
You can't even train ligatures longer than 2 characters.
Original comment by gtrw...@gmail.com
on 15 Mar 2008 at 7:17
2.03 accepts ligatures up to 24 bytes of utf-8.
More work is being done in 3.00 to improve accuracy on ligatures.
Original comment by theraysm...@gmail.com
on 28 Dec 2008 at 7:22
Fixed in 3.00.
Original comment by theraysm...@gmail.com
on 20 May 2010 at 7:00
For the Danish training data for download at least, it now overzealously
interprets "fi" as the ligature "fi", even though that ligature is practically
never encountered in Danish texts (while "fi" is very common, and kerning often
causes the top right of the "f" to overlap the dot of the "i").
It can be argued that this isn't a bug; afterall, there is no way to tell from
the individual shape that it isn't a ligature - but the language makes it very
unlikely. It's easy to work around by replacing "fi" with "fi". Still, it
would be nice if "fi" in Danish was only used when the full word would then
match a word in the dictionary.
Original comment by joakim.a...@gmail.com
on 18 Jan 2011 at 11:32
You can blacklist them
https://groups.google.com/forum/#!topic/tesseract-ocr/jO_4ZMMK9xw
but http://stb-tester.com/blog/2014/04/14/improving-ocr-accuracy says it's
better to detect them as ligatures and then post-process to expand them to
normal characters afterward. It seems you have to do that yourself, though.
Here's their python file that does it:
https://github.com/stb-tester/stb-tester/blob/91f7f22309ada6fc2f1b35f4321c49dc7b
32fb5c/stbt.py#L677
Original comment by omegat...@gmail.com
on 11 Apr 2015 at 12:44
Original issue reported on code.google.com by
mrondine...@gmail.com
on 23 Sep 2007 at 10:07