Closed nikhilcms closed 2 years ago
Please have a look on combine_tessdata
and its documentation. That program allows listing the components which are part of the traineddata file. You will see that the newly trained model file misses some of the components of the original file, for example the dictionary. Therefore it is significantly smaller. It also allows adding such components again. So you can try adding all missing components and see whether that has an effect on the OCR result.
Please use the Tesseract user forum for additional questions.
Hi @stweil , thanks for your reply at the time of training, I have added model_name.wordlist, model_name.punc, model_name.numbers in data/model_name directory. the traindata resulted as 15mb file, after unpacking the 15mb traindata(my finetuned model) and 24mb traindata( original traindata ), i found there are some files missing in finetuned model, so is it really necessary to add all those missing files in my finetuned model before inference ?
Hi, originally eng.traindata file size ~24 MB, but after fine-tuning it on custom dataset(start model is tessdata_best) it return 11 MB traindata file , can you please confirm is it fine or not ? , results is good on finetuned model.