tesseract-ocr / langdata_lstm

Data used for LSTM model training
Apache License 2.0
114 stars 152 forks source link

Missed letter in the hye.traineddata #49

Open reneclais opened 10 months ago

reneclais commented 10 months ago

In the hye.traineddata the letter և is not included. This letter is replaced by the letter ն . Indeed the two letters aspect are very similar, but they have not the same signification. I have found that in the old arm.traineddata there is no such a problem.

stefan6419846 commented 10 months ago

This is the wrong repository for reporting this in my opinion.

Nevertheless, there is no arm model in the official repositories, only an ara and an asm one. The general configuration is in the langdata and langdata_lstm repositories, the trained models are in the tessdata* repositories. As the models have been trained by Google most of time, there probably will not be any change to fix this character, but you might decide to train your own fixed model and maybe provide it to the public inside the tessdata_contrib repository.

stweil commented 10 months ago

Armenian.traineddata contains the missing character, so I suggest to try that model.

stweil commented 10 months ago

I'll transfer this issue from tesstrain to langdata_lstm where it fits better.