tesseract-ocr / langdata_lstm

Data used for LSTM model training
Apache License 2.0
114 stars 151 forks source link

NO fas.unicharset and fas.xheights file for Persian Language #60

Open AinazRafiei opened 1 month ago

AinazRafiei commented 1 month ago

There are no fas.xheights and fas.unicharset file for Persian language.Without these data how can we train tesseract with LSTM on persian language.Coulde you please add them or guide how can we make them ?

amitdo commented 1 month ago

@stweil,

If you want to fix this issue, the fas.unicharset file can be extracted from the fas.traineddata.

IIRC, the xheights file is not needed for LSTM training.

stweil commented 1 month ago

Is this an issue? There is also no eng.xheights ~and eng.unicharset~ in the repository, and the same is true for all other languages, too.

@AinazRafiei, why do you think that you need those files?

amitdo commented 1 month ago

fas.unicharset

eng.unicharset

stweil commented 1 month ago

Ah, yes, sorry. So what remains to be fixed? Or can this issue be closed?

AinazRafiei commented 1 month ago

Is this an issue? There is also no eng.xheights ~and eng.unicharset~ in the repository, and the same is true for all other languages, too.

@AinazRafiei, why do you think that you need those files? why IIRC, the xheights file is not needed for LSTM training? I want to trian tesseract4 with LSTM on a custom Persian dataset. Actually finetune Tesseract on my dataset to resolve errors when recognizing the text of the images in my database.I followed training tesseract instructions. In (https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html#building-the-training-tools) in Tesstutorial part said we need unicharset and xheight file.I trained tesseract4 without these(unicharset and xheight) with an imperfect unicharset file which generate during training and as I expected got error Encoding of string failed and the trining error was too high because the characters in training data didn't exist in unicharsetfile and during trianing and file ignored by model, therefore, training can not be done accurately . I believe the problem of the high training error rate is because of the absence of unicharset and xheight files for persian language.I dont want to trian from scratch because I don't have enogh datasets to do that. There is method call cutoff layer mentioned in Tesseract documentation but I didnt understand that. If there is any solution to finetune in my case could you please tell me ?

AinazRafiei commented 1 month ago

fas.unicharset

eng.unicharset

The unicharset files in language folder like eng and fas are very different from files outside the folders. unicharset files that are outside are much more completed and have lots of unichars in language like unichars in different fonts. Unicharset in language folders are files that generated during training with Tesseract dataset on a language.Its not useable when you want to train tesseract on your dataset because it is different from dataset Tesseract used .