[This is a feature request]
I would love to use Teserract against agglutinative languages [
https://en.wikipedia.org/wiki/Agglutinative_language ], but the nature of these
languages make it almost impossible to come up with a static language file: A
file that may claim to cover all words in of these languages can run into
several hundred billion lines --and, even then there's no guarantee that it's
complete. Which means, for these languages, a little more intelligence (in the
form of morphological analysis) is needed before sanctioning a word.
And there's where the problem arises: Teserract requires pre-prepared language
files. And, this, as explained above, is just not good enough.
Yet, the solution is simple: Make the word-sanctioning a call-back, and let the
calling application decide whether or not it is a correct word --and, if it's
not, supply the correct(ed) one back to Teserract.
Actually, having this kind of a mechanism would also benefit those people who
would use their own spell-checkers (hunspell, aspell, myspell etc.); or those
who would handle context-sensitive word selection more accurately.
For someone familiar with Teserract's codebase and proficient enough in C++
(which, unfortunately, I am neither), doing this shouldn't be too hard --or so
I am hoping :)
BTW, naturally, having this kind of a mechanism does not mean that I am
suggesting Teserract should do away with its current mode of operation; all I
am saying is this: if the user/caller has set the callbacks, Teserract should
use the callback. If not, then business as usual.
Original issue reported on code.google.com by adem.meda@gmail.com on 17 Jul 2012 at 1:14
Original issue reported on code.google.com by
adem.meda@gmail.com
on 17 Jul 2012 at 1:14