tesseract-ocr / langdata

Source training data for Tesseract for lots of languages
Apache License 2.0
837 stars 888 forks source link

LSTM: character set vs Script unicharset vs Training text unicharset #33

Closed Shreeshrii closed 7 years ago

Shreeshrii commented 7 years ago

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 refers to

The lstmtraining program is a multi-purpose tool for training neural networks. The following table describes its command-line options:

Flag    Type    Default Explanation
U   string  none    Path to the unicharset for the character set.

and also

Fine tuning is the process of training an existing model on new data without changing anything else like the character set or any part of the network. Doesn't need a unicharset, script_dir, or net_spec, as they all come from the existing model.

and

Fine tuning is OK if you don't want to change the character set, but what if you want to train for Klingon? You are unlikely to have much training data and it is unlike anything else, so what do you do? You can try removing some of the top layers of an existing network model, replace some of them with new randomized layers, and train with your data. The command-line is mostly the same as Training from scratch, as you have to supply a unicharset and net_spec, and you also have to provide a model to --continue_from and --append_index.

@theraysmith Ray, please clarify what is the character set referred to here? Thanks!

theraysmith commented 7 years ago

Fix on the way.