Open tokee opened 7 years ago
@tokee
buildscript.sh is setup for training for 3.0x - for 4.0 training you have to use tesstrain.sh script given at https://github.com/tesseract-ocr/tesseract/blob/master/training/tesstrain.sh
The LSTM training process is currently only been described for synthetic images created by text2image program - not for pre-existing box-tiff pairs.
you can take a look at frk.traineddata and langdata.
If you want to use these box/tiff pairs, you will need to modify the box files, adding a new box with tab character at the end of the line.
See attached file as a sample. frk.embedsiver.exp0.box.txt
You can use a boxeditor, such as Jtessboxeditor to do so.
Just a heads-up: Thank you for your help. Work dictates that I spend the next week on other things, but I'll get back to tesseract after that.
Hi,
Any update?
How can I use the files with the latest tesseract version?
Sorry, my priorities were shifted. OCR is now "sometime later this year". No guarantee they won't be shifted again.
AviFix, just in case you are not aware of it, I would like to note that you can use the traineddata files generated with tesseract 3 just fine with tesseract 4, so you are not left completely in the dark.
This issue would be solved by running the training process with the latest ocr engine (LSTM), which means starting over with a new set of files and a different approach. I have also been intending to look at this, but I'm not actively working on it and can't offer any timeline.
you can use the traineddata files generated with tesseract 3 just fine with tesseract 4
use --oem 0
I would very much like to try out the new tesseract 4 alpha LSTM with fraktur, but cannot find any trained fraktur models anywhere. So I tried running
buildscript.sh
indan_frak
, but got a lot of errors and a 691 bytedan_frak.traineddata
. Same story withdeu_frak
.