tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
630 stars 184 forks source link

Regarding fine-tuned traindata model size #284

Closed nikhilcms closed 2 years ago

nikhilcms commented 2 years ago

Hi, originally eng.traindata file size ~24 MB, but after fine-tuning it on custom dataset(start model is tessdata_best) it return 11 MB traindata file , can you please confirm is it fine or not ? , results is good on finetuned model.

stweil commented 2 years ago

Please have a look on combine_tessdata and its documentation. That program allows listing the components which are part of the traineddata file. You will see that the newly trained model file misses some of the components of the original file, for example the dictionary. Therefore it is significantly smaller. It also allows adding such components again. So you can try adding all missing components and see whether that has an effect on the OCR result.

stweil commented 2 years ago

Please use the Tesseract user forum for additional questions.

nikhilcms commented 2 years ago

15mb_traindata

22mb_traindata

Hi @stweil , thanks for your reply at the time of training, I have added model_name.wordlist, model_name.punc, model_name.numbers in data/model_name directory. the traindata resulted as 15mb file, after unpacking the 15mb traindata(my finetuned model) and 24mb traindata( original traindata ), i found there are some files missing in finetuned model, so is it really necessary to add all those missing files in my finetuned model before inference ?