itwood / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

all language related operations as callbacks #734

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
[This is a feature request]

I would love to use Teserract against agglutinative languages [ 
https://en.wikipedia.org/wiki/Agglutinative_language ], but the nature of these 
languages make it almost impossible to come up with a static language file: A 
file that may claim to cover all words in of these languages can run into 
several hundred billion lines --and, even then there's no guarantee that it's 
complete. Which means, for these languages, a little more intelligence (in the 
form of morphological analysis) is needed before sanctioning a word.

And there's where the problem arises: Teserract requires pre-prepared language 
files. And, this, as explained above, is just not good enough.

Yet, the solution is simple: Make the word-sanctioning a call-back, and let the 
calling application decide whether or not it is a correct word --and, if it's 
not, supply the correct(ed) one back to Teserract.

Actually, having this kind of a mechanism would also benefit those people who 
would use their own spell-checkers (hunspell, aspell, myspell etc.); or those 
who would handle context-sensitive word selection more accurately.

For someone familiar with Teserract's codebase and proficient enough in C++ 
(which, unfortunately, I am neither), doing this shouldn't be too hard --or so 
I am hoping :)

BTW, naturally, having this kind of a mechanism does not mean that I am 
suggesting Teserract should do away with its current mode of operation; all I 
am saying is this: if the user/caller has set the callbacks, Teserract should 
use the callback. If not, then business as usual.

Original issue reported on code.google.com by adem.meda@gmail.com on 17 Jul 2012 at 1:14

GoogleCodeExporter commented 9 years ago

Original comment by zde...@gmail.com on 21 Jul 2012 at 4:45