tesseract-ocr / langdata_lstm

Data used for LSTM model training
Apache License 2.0
115 stars 152 forks source link

how to train this files to get .traineddata #40

Closed josef821 closed 2 years ago

josef821 commented 3 years ago

hi. i want to train new font and character image to fin lang. i want to train character with noise and angle. how can i use this files : desired_characters fin.numbers fin.punc fin.singles_text fin.training_text fin.unicharambigs fin.unicharset fin.wordlist okfonts.txt

to get .traineddata files like tessdata_best. should i use tesstrain ( https://github.com/tesseract-ocr/tesstrain ) or use text2image and create box then train ( https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html )

Shreeshrii commented 3 years ago

tesstrain repo is useful when you have scanned line images and their groundtruth transcription.

Use text2image and lstm.train to create lstmf files (use tesstrain.sh bash script). You will need to run lstmtraining after that.

josef821 commented 3 years ago

i want to add some new font to fas tessdata_best. what is your prefer ? create groundtruth and use tesstrain OR Use text2image and lstmtraining ? Should fonts be used randomly during training or should I train each font separately and combine each output file at the end?

Shreeshrii commented 3 years ago

Should fonts be used randomly during training

Yes.

create groundtruth and use tesstrain

Yes. Because, it will run lstmtraining for you.