tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
630 stars 184 forks source link

Does not create lstmf file: Compute CTC targets failed! #273

Closed townim-faisal closed 3 years ago

townim-faisal commented 3 years ago

I have tried to train this bengali data.

I have run this command: make training MODEL_NAME=ben RATIO_TRAIN=0.90 LANG_TYPE=Indic TESSDATA=/mnt/c/T.Faisal/OFFICE/tesstrain START_MODEL=ben GROUND_TRUTH_DIR=data/ben-ground-truth DATA_DIR=data

I have used the best traineddata from here.

But it does not generate lstmf file in ben-ground-truth folder for all images like an example _1461359216_fathers_nameinfo.png in data. It also gives Normalization failed for string. You can check here: ben-ground-truth.zip

And it gives the following output.

kba commented 3 years ago

It's very little data and it isn't properly segmented. IIUC you have just seven images in the data and of those, some are two lines (e.g. 1002509501_address_info.png) or even three lines (e.g. 1009113257_address_info.png).

So I suggest you get some more data and make sure that the image-ground-truth pairs are only a single line.

townim-faisal commented 3 years ago

Thank you for your reply @kba. Can you describe to me what do you mean by "it isn't properly segmented"?

kba commented 3 years ago

tesstrain (and every modern trainable OCR engine) is trained on line-wise image/text ground-truth. As I said, you have multi-line images in there, that should be split into individual lines. @stefanCCS had a similar issue.

townim-faisal commented 3 years ago

Thanks @kba for your help. I am closing this issue.