tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
630 stars 184 forks source link

What is the impact of the redundancy of tokens #240

Closed akmalkadi closed 3 years ago

akmalkadi commented 3 years ago

Greetings,

There are many tokens redundant in a dataset (pair of text lines and image of the lines). Sometimes a word can appear in +10k lines. Do I need only one appearance for each token? Will a token with +10k appearing, will have more priority in the recognition?

wrznr commented 3 years ago

No. To both questions. You are training on line images. The transition probabilities are estimated on the image level. It is neither useful nor necessary to try to somehow modify the prior in your training data. Although you have to ensure that you have each character you later want to recognize in your training data (with sufficient frequency).