tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
620 stars 181 forks source link

Error make: *** No rule to make target 'data/ben-ground-truth/20-019.lstmf', needed by 'data/ben/all-lstmf' during training #220

Closed srdg closed 3 years ago

srdg commented 3 years ago

I have been trying to fine-tune the existing ben.traineddata model to suit my use-case but somehow this particular error keeps popping up:

 find data/ben-ground-truth -name '*.gt.txt' | xargs cat | sort | uniq > "data/ben/all-gt"
unicharset_extractor --output_unicharset "data/ben/unicharset" --norm_mode 2 "data/ben/all-gt"
Bad box coordinates in boxfile string! 
Extracting unicharset from plain text file data/ben/all-gt
Wrote unicharset file data/ben/unicharset
make: *** No rule to make target 'data/ben-ground-truth/20-019.lstmf', needed by 'data/ben/all-lstmf'.  Stop.

I tried to train the data on google colab environment.
This bash script highlights the steps I used to train this data.
My search results told me that this was an error pertaining to the Makefile but I am not sure exactly what is the fault here. FYI, the flagged file name in the error changes once in a while, so it is not consistent. Exactly what can be done to fix this issue?

Shreeshrii commented 3 years ago
:~/tesstrain/data/ben-ground-truth$ ls
10-002.box       12-004.box       14-005.box       16-008.box       18-009.box       20-011.box       22-013.box       24-015.box       26-017.box       28-019.box       30-021.box
10-002.exp0.tif  12-004.exp0.tif  14-005.exp0.tif  16-008.exp0.tif  18-009.exp0.tif  20-011.exp0.tif  22-013.exp0.tif  24-015.exp0.tif  26-017.exp0.tif  28-019.exp0.tif  30-021.exp0.tif

Your naming of tif and gt.txt files does not match. Fix by removing exp0. from the tif file names.

Then train with the following (name TESSDATA dir as per your environment(.

nohup make training MODEL_NAME=ben RATIO_TRAIN=0.80 LANG_TYPE=Indic  DEBUG_INTERVAL=-1 TESSDATA=$HOME/tessdata_best MAX_ITERATIONS=99999999 START_MODEL=ben   > data/ben.log &
Shreeshrii commented 3 years ago

29-026.gt.txt is empty so creates a box file of zero size. Fix the ground truth file.

Looks like that you have not reviewed the gt.txt and line image tifs. Some of them don't seem to be aligned the same. Please review the log to help identify files to be fixed. Examples:

Iteration 11: GROUND TRUTH : থাকব। Iteration 11: ALIGNED TRUTH : থথথাক।ব।।।। Iteration 11: BEST OCR TEXT : লোভ দেখিয়ে দ্বেবী তাকে সঙ্গে রাখতে চান। সে বললে,--আচ্ছা রাজ্রে File data/ben-ground-truth/25-010.lstmf line 0 :

Iteration 18: GROUND TRUTH : একা বনের মধ্যে ফেলে রাখার বিরুদ্ধে বিদ্রোহ তুলেছিল। Iteration 18: ALIGNED TRUTH : একা বনের মধধ্ধধ্যে ফেলেে রাখার বিরুদ্েধ বেে বিদ্্রোহ ততুলেছিল। Iteration 18: BEST OCR TEXT : কুটারের দিকে চেয়ে পাহারা রাখত। তার তরুণ বীর হৃদয় এক ভীরু নারীকে File data/ben-ground-truth/25-021.lstmf line 0 :

Iteration 23: GROUND TRUTH : পাষাণ হওয়ার পর ? তা আমি পারব না। Iteration 23: ALIGNED TRUTH : পাষাণ হ হওয়য়ার পর ? তা আ আমি পারবব নাানা। Iteration 23: BEST OCR TEXT : কোরো না প্রদ্যুন, ভেবে দেখ, মৃত্যুর পর হয় তো পরজগৎ আছে কিনত File data/ben-ground-truth/27-003.lstmf line 0 :

Iteration 54: GROUND TRUTH : তার কোন উপায় নেই। Iteration 54: ALIGNED TRUTH : ততার র ককোন উপায় নেনেইেই। Iteration 54: BEST OCR TEXT : মন্ত্রপুত অল দেবীর গায়ে ছড়িয়ে দিলে তিনি আবার মুক্ত হবেন বটে, কিন্তু File data/ben-ground-truth/26-025.lstmf line 0 :

srdg commented 3 years ago

@Shreeshrii thank you for all your help! I wrote the text corpus manually and then used a python script to split into several gt.txt files - looks like I had somehow deleted one line and put an empty newline character in it. I am able to run the training. Again, thank you! Closing this issue since it is resolved.