Closed townim-faisal closed 3 years ago
It's very little data and it isn't properly segmented. IIUC you have just seven images in the data and of those, some are two lines (e.g. 1002509501_address_info.png) or even three lines (e.g. 1009113257_address_info.png).
So I suggest you get some more data and make sure that the image-ground-truth pairs are only a single line.
Thank you for your reply @kba. Can you describe to me what do you mean by "it isn't properly segmented"?
tesstrain (and every modern trainable OCR engine) is trained on line-wise image/text ground-truth. As I said, you have multi-line images in there, that should be split into individual lines. @stefanCCS had a similar issue.
Thanks @kba for your help. I am closing this issue.
I have tried to train this bengali data.
I have run this command:
make training MODEL_NAME=ben RATIO_TRAIN=0.90 LANG_TYPE=Indic TESSDATA=/mnt/c/T.Faisal/OFFICE/tesstrain START_MODEL=ben GROUND_TRUTH_DIR=data/ben-ground-truth DATA_DIR=data
I have used the best traineddata from here.
But it does not generate lstmf file in
ben-ground-truth
folder for all images like an example _1461359216_fathers_nameinfo.png in data. It also givesNormalization failed for string
. You can check here: ben-ground-truth.zipAnd it gives the following output.