tesseract-ocr / tessdoc

Tesseract documentation
https://tesseract-ocr.github.io/tessdoc/
1.83k stars 363 forks source link

Creating training data using tesstrain.sh #39

Open InbarShapira opened 3 years ago

InbarShapira commented 3 years ago

It is not clear when creating training data using tesstain.sh for the LSTM model should I use --langdata_dir langdata_lstm or to use --langdata_dir langdata?

It defect which eng.training_text file will be used to generate the training data

what should I use?

Shreeshrii commented 3 years ago

For the LSTM model, use --langdata_dir langdata_lstm

You can limit the number of pages, if doing finetuning.

InbarShapira commented 3 years ago

So if I want to train a LSTM model from scratch, that will reach the Tesseract accuracy that is in the LSTM model what training data do I need create and how?

Shreeshrii commented 3 years ago

See https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html#training-text-requirements

udibarzi commented 3 years ago

Thanks @Shreeshrii I went over this documentation and something is still not clear to me.

When following the instructions, the script creates a tiff file with ~50 lines per page and a total of ~3700 pages which is a total of 185,000 lines of text for just a single font. The instructions specify to use ~4000 fonts for English so the total number of lines that will be created is 4000*185,000 whereas according to this post (https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951) the training set comprises only 400,000-800,000 textlines.

What am I missing?

Shreeshrii commented 3 years ago

Our knowledge about the training method is based on Ray Smith's posts and comments. It is possible that he experimented with different settings and the posts at different times reflect that.

https://github.com/tesseract-ocr/tessdoc/blob/master/Data-Files-in-tessdata_fast.md shows the following info for English traineddata.

Version string:4.00.00alpha:eng:synth20170629 LSTM training info:Network str:[1,36,0,1Ct3,3,16Mp3,3Lfys48Lfx96Lrx96Lfx192O1c1], flags=41, iteration=6352400, sample_iteration=6352704, null_char=110, learning_rate=0.001, momentum=0.5, adam_beta=0.999

While for tessdata_best it is

eng Version string:4.00.00alpha:eng:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1] LSTM training info:Network str:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1], flags=40, iteration=814100, sample_iteration=814136, null_char=110, learning_rate=0.001, momentum=0.5, adam_beta=0.999

Look at number of iterations to see the difference.

I haven't seen any post where someone has been able to replicate his results.

kseniazhagorina commented 3 years ago

Hello. In your instructions https://github.com/tesseract-ocr/tessdoc/blob/main/tess4/TrainingTesseract-4.00.md#using-tesstrainsh your mention the file tesstrain.sh at https://github.com/tesseract-ocr/tesseract/blob/main/src/training/tesstrain.sh but there is no such file in tesseract and also you write that Training with tesstrain.sh (a.k.a tesseract 4 training) in unsupported/abandoned. Please use scripts from https://github.com/tesseract-ocr/tesstrain for training

https://github.com/tesseract-ocr/tessdoc/commit/d57b942ebf054a3d34a11293e1465eb2379ae6ff Could you please update instructions