Open InbarShapira opened 3 years ago
For the LSTM model, use --langdata_dir langdata_lstm
You can limit the number of pages, if doing finetuning.
So if I want to train a LSTM model from scratch, that will reach the Tesseract accuracy that is in the LSTM model what training data do I need create and how?
Thanks @Shreeshrii I went over this documentation and something is still not clear to me.
When following the instructions, the script creates a tiff file with ~50 lines per page and a total of ~3700 pages which is a total of 185,000 lines of text for just a single font. The instructions specify to use ~4000 fonts for English so the total number of lines that will be created is 4000*185,000 whereas according to this post (https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951) the training set comprises only 400,000-800,000 textlines.
What am I missing?
Our knowledge about the training method is based on Ray Smith's posts and comments. It is possible that he experimented with different settings and the posts at different times reflect that.
https://github.com/tesseract-ocr/tessdoc/blob/master/Data-Files-in-tessdata_fast.md shows the following info for English traineddata.
Version string:4.00.00alpha:eng:synth20170629 LSTM training info:Network str:[1,36,0,1Ct3,3,16Mp3,3Lfys48Lfx96Lrx96Lfx192O1c1], flags=41, iteration=6352400, sample_iteration=6352704, null_char=110, learning_rate=0.001, momentum=0.5, adam_beta=0.999
While for tessdata_best it is
eng Version string:4.00.00alpha:eng:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1] LSTM training info:Network str:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1], flags=40, iteration=814100, sample_iteration=814136, null_char=110, learning_rate=0.001, momentum=0.5, adam_beta=0.999
Look at number of iterations to see the difference.
I haven't seen any post where someone has been able to replicate his results.
Hello.
In your instructions https://github.com/tesseract-ocr/tessdoc/blob/main/tess4/TrainingTesseract-4.00.md#using-tesstrainsh
your mention the file tesstrain.sh at https://github.com/tesseract-ocr/tesseract/blob/main/src/training/tesstrain.sh
but there is no such file in tesseract
and also you write that
Training with tesstrain.sh
(a.k.a tesseract 4 training) in unsupported/abandoned. Please use scripts from https://github.com/tesseract-ocr/tesstrain for training
https://github.com/tesseract-ocr/tessdoc/commit/d57b942ebf054a3d34a11293e1465eb2379ae6ff Could you please update instructions
It is not clear when creating training data using tesstain.sh for the LSTM model should I use --langdata_dir langdata_lstm or to use --langdata_dir langdata?
It defect which eng.training_text file will be used to generate the training data
what should I use?