Closed GoogleCodeExporter closed 9 years ago
Wow, it is so easy! Why did you not program it, yet?
This is open source program - so everybody can contribute. I am waiting for
your training program desperately.
Original comment by zde...@gmail.com
on 13 Feb 2012 at 11:17
Excellent idea. Thanks to Zde, project Member - for his support. Yes I am also
waiting for your wonderful training program which is boon for users. Kindly
start to build program without further delay.Wishing you all the best of Good
Luck in your Good mission.
Original comment by withbles...@gmail.com
on 13 Feb 2012 at 4:30
@mlissner: there were such attempts
http://code.google.com/p/tesseractindic/source/browse/#svn/trunk/tesseract_train
er - I tried to improve it see
https://github.com/zdenop/tesseract-auto-training, but I come back for training
from real scans...
IMHO most difficult (in current process as described on wiki) is to create GOOD
input images (with box files). Other steps can be done with simple script (see
https://github.com/paalberti/tesseract-dan-fraktur and there
https://github.com/paalberti/tesseract-dan-fraktur/blob/master/swe-frak/buildscr
ipt.sh)
Original comment by zde...@gmail.com
on 13 Feb 2012 at 8:55
As a new user approaching Tesseract, those scripts look great. Unfortunately, I
didn't find them when I needed them, and they're probably out of date anyway
since they're not built into Tesseract. Is there a reason we can't provide
something like these to make the process easier?
The only step I see that needs human work is adjusting the box files. The rest
seems like it should be done by a computer, and even adjusting the box files
can be made pretty easy if we have a script that can merge the locations from
the box file (roughly) with the letters from an input file.
Also - if the challenge is to create GOOD input images, that seems like another
reason to build this into Tesseract itself, so that such images can be created
by a computer, not by scanning/manipulating in iterative and hacky ways.
Original comment by mliss...@michaeljaylissner.com
on 13 Feb 2012 at 10:06
Lots of times, people don't have the desirable fonts; all they have is some
scanned images of old documents they want to digitize. If you have the fonts,
you can use jTessBoxEditor to generate TIFF/Box files suitable for training
with Tesseract. Once you get a good set of them, you can use train.ps1 to
automate generation of language data files.
Original comment by nguyen...@gmail.com
on 15 Feb 2012 at 1:31
The reason for scanning images and not generating from fonts is because the
classifier needs to be robust to the sorts of distortions that happen when 1)
text is printed; and 2) when that text is scanned.
Generating images with such distortions is an extremely under-researched area -
I was only able to find one (1) paper on the topic. OCRopus had an
implementation of the techniques presented in that paper, once upon a time, but
I think it has been rewritten twice since then, and that component was not
included in either rewrite.
Google have some software to do this, but it's written to target their internal
facilities and needs to be rewritten to work anywhere else. They are going to
release it, but there's no definite timeline (I was told about this in 2010). I
presume it includes a distortion component, but never asked when I had the
chance. In any case, (IIRC) it was used to generate the language data in the
3.x series.
ImageJ has a component to generate a distortion model from a pair of images,
that might be useful in the meantime.
(Oh, and if you think the tesseract training documentation is scary, don't ever
look at the opencv documentation :)
Original comment by joregan
on 23 Feb 2012 at 11:47
Balthazar Rouberol created such tool as requested by reporter - see
https://github.com/BaltoRouberol/TesseractTrainer
so I am closing this issue...
Original comment by zde...@gmail.com
on 30 Jul 2012 at 11:59
Original issue reported on code.google.com by
mliss...@michaeljaylissner.com
on 13 Feb 2012 at 7:59