tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
604 stars 178 forks source link

I get "Failed to load any lstm-specific dictionaries for lang ea!!" when predicting with fintuned model #252

Closed soufieneghribi closed 2 years ago

soufieneghribi commented 3 years ago

I am trying to finetune a model with a hundreds of images. I am using the ara model as the START_MODEL running this command: make training MODEL_NAME=ea START_MODEL=ara TESSDATA=data/ara

The new model is supposed to perform better than the base model (ara). But I am getting bad results Example:

image

With ara best model I get: بن عثمان بِن الهادي _

with the finetuned model I get: ﻦﻴﻣﺟﺟﺟﺟ_ﺎﻟﺒﻟﻤﻟﻟﺍﻣ

wrznr commented 3 years ago

Two very different things in one issue.

Failed to load any lstm-specific dictionaries for lang ea

This expected since dictionaries are not carried over to the finetuned model (on purpose). Using dictionaries is not recommended in general for LSTM-based OCR.

The new model is supposed to perform better than the base model

Not necessarily. This very much depends on your training and especially on the images you feed into it. Your example has a very bad line segmentation. Are you using PSM=7 or PSM=13?

soufieneghribi commented 3 years ago

This expected since dictionaries are not carried over to the finetuned model (on purpose). Using dictionaries is not recommended in general for LSTM-based OCR.

Thank you for your response. I'm not using dictionaries, IS this acceptable as a warning? Could it affect my trained model?

Not necessarily. This very much depends on your training and especially on the images you feed into it. Your example has a very bad line segmentation. Are you using PSM=7 or PSM=13?

I am using PSM=13 for the training and prediciton.

wrznr commented 3 years ago

Thank you for your response. I'm not using dictionaries, IS this acceptable as a warning? Could it affect my trained model?

Yes. Definitely. No worries.

soufieneghribi commented 3 years ago

I followed the documentation . I prepared 350 one line Arabic images (xx.png) and their transcript (xx.gt.txt) and starting training with START_MODEL=ara.

make training MODEL_NAME=elda START_MODEL=ara TESSDATA=data/ara_best

image

I am getting 100% as error rate.

aa

I there something I am missing?

wrznr commented 3 years ago

What is ara_best? The TESSDATA parameter should direct to your Tesseract model directory (e.g. /usr/local/share/tessdata in most cases).

soufieneghribi commented 3 years ago

Yes it was ara best. I put it under repo data/ara

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.