tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
637 stars 188 forks source link

Finetuning for limited number of characters #69

Closed canyilmaz90 closed 5 years ago

canyilmaz90 commented 5 years ago

Is that possible finetuning only for characters that exist in your char set? To be more clear, I have a dataset which consists of some latin characters which is a subset of original english lang data and I don't want to have the other characters, or symbols. Can I do that still using finetune with START_MODEL. @kba @wrznr @shreeshrii

wrznr commented 5 years ago

I do not think that this is possible. You can extend the number of characters or change the probability distribution via fine tuning but I doubt that it is possible to exclude characters which are present in the original model you continue from.

canyilmaz90 commented 5 years ago

Thank you for replying. I couldn't find any whitelisting option too. Fortunately, if you have large amount of data, it excludes the unnecessary chars by itself.

wrznr commented 5 years ago

Thanks again!