tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
599 stars 178 forks source link

How to do incremental-training on tesseract-ocr? #391

Open zaryabRiasat opened 1 month ago

zaryabRiasat commented 1 month ago

I'm working with tesseract-4.1.1 and trying to do training (fine-tuning) for this I have followed steps:

  1. Downloaded eng.traineddata from tessdata_bestand pasted it into /usr/share/tesseract-ocr/4.00/tessdata.

  2. Then I've created image-crops using craft-text-detector in python and made ground-truths (.gt.txt) for each image crop.

  3. Then cloned git clone https://github.com/tesseract-ocr/ocrd-train.git and then cd ocrd-train.

  4. Inside ocrd-train/data folder, I've created my-model-ground-truth folder and pasted .png and .gt.txt files in it.

  5. Then I ran command make tesseract-langdata on terminal.

  6. At last I ran command make training MODEL_NAME=my-model MAX_ITERATIONS=20000 PSM=7 FINETUNE_TYPE=Impact DEBUG_INTERVAL=-1 START_MODEL=eng TESSDATA=/usr/share/tesseract-ocr/4.00/tessdata/

Above procedure took some time, and I got my-model.traineddata file in ocrd-train/data/. I've pasted that file in /usr/share/tesseract-ocr/4.00/tessdata and it is giving results better than eng.traineddata.

For above training I used 20 images, now I want to do incremental-training. I want to train 30 more images on previously trained my-model.traineddata. Here I'm confused because after completion of previous training there are some folder in ocrd-train/data/:

  1. my-model (folder)

  2. my-model-ground-truth (folder)

  3. eng (folder)

  4. langdata (folder)

  5. my-model.traineddata (file)

Now what should I do for incremental-training?

Do I only need to remove files in my-model-ground-truth and paste new .png and .gt.txt files of 30 images, and use my-model as START_MODEL?

Or I need to remove other folders as well?

stweil commented 1 month ago

Are you using very old instructions (old Tesseract release, old repository URL, ...)?

zaryabRiasat commented 1 month ago

@stweil Thank You for your response.

Yes I'm using tesseract-4.1.1, Old Repository.

First time training is working fine with START_MODEL=eng, but I am unable to do incremental training as mentioned in above details.

zaryabRiasat commented 1 month ago

@stweil I just want to know, how can I do incremental-training on my existing trained model?

What steps I should follow?

zdenop commented 1 month ago

What about reading Tesseract documentation and Readme of this repository?

stweil commented 1 month ago

@zaryabRiasat, the first step is using a recent software release instead of an old one and also reading the current documentation.