tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
620 stars 181 forks source link

obscure OCR after fine tuning? #222

Closed jbarth-ubhd closed 3 years ago

jbarth-ubhd commented 3 years ago

Dear reader, I'm trying to improve the deu.traineddata using the Bembo font. Tried with tesseract-4.1.1 and current master; did the following:

text2image --fonts_dir /tmp/Bembo-Std/ --list_available_fonts

cd tesseract && src/training/tesstrain.sh --exposures "-1 0 1" --fonts_dir /tmp/Bembo-Std \
  --fontlist "Bembo Std" "Bembo Std Bold" "Bembo Std Bold Italic" \
   "Bembo Std Italic" "Bembo Std Semi-Bold" \
   "Bembo Std Semi-Bold Italic" --lang deu --linedata_only \
  --noextract_font_properties --langdata_dir ../langdata \
  --tessdata_dir /usr/local/share/tessdata --output_dir ~/tesstutorial/deutrain

mkdir -p ~/tesstutorial/bembo_from_full
combine_tessdata -e tessdata/best/deu.traineddata \
  ~/tesstutorial/bembo_from_full/deu.lstm

lstmtraining --model_output ~/tesstutorial/bembo_from_full/bembo --max_image_MB 60000 \
  --continue_from ~/tesstutorial/bembo_from_full/deu.lstm \
  --traineddata tessdata/best/deu.traineddata \
  --train_listfile ~/tesstutorial/deutrain/deu.training_files.txt \
  --max_iterations 400 2>&1 | tee lstmtraining.log

lstmtraining --stop_training \
  --continue_from ~/tesstutorial/bembo_from_full/bembo_checkpoint \
  --traineddata ~/tesstutorial/deutrain/deu/deu.traineddata \
  --model_output ~/tesstutorial/bembo_from_full/deu.traineddata

OCR before training with tessdata/best/deu.traineddata:

Die Schrift ist schr sorgfältig und gekonnt ausgeführt, sie zeigt einen starken Hang zu ornamentaler Ge-
staltung, der auch bei der Wappenritzung ganz deutlich hervortritt.

after training, model tesstutorial/bembo_from_full/deu.traineddata:

o8( E:°ä8*” 87” 7:°ä 7+ä_*ö\”8_ q>< _($+>>” !q7_(*Z°ä”t 78( @(8_” (8>(> 7”!ä$(> «!>_ @q +ä>!)(>”!\(ä d(Ö
7”!\”q>_t <(ä !q:° £(8 <(ä n!%%(>ä8”@q>_ _!>@ <(q”\8:° °(ä“+ä”ä8””2
jbarth-ubhd commented 3 years ago

ah... > is n, ( is e...

Perhaps I could build some machine learning algorithm to decipher this.

kba commented 3 years ago

That looks a bit like rot13 but not quite ;) I suspect the unicharset got corrupted when combining the traineddata files.

Since you aren't actually using tesstrain but tesseract's training tools directly, you might want to ask on the tesseract mailing list.

jbarth-ubhd commented 3 years ago

Is font "fine tuning" possible with tesstrain.sh and nothing else?

stweil commented 3 years ago

Fine tuning uses lstmtraining. And that is used by the old tesstrain.sh or the replacement tesstrain.py scripts (both part of the tesseract repository), and it is also used by the Makefile of the tesstrain repository. All of them support fine tuning. And of course you can write your own wrappers.

I usually use the tesstrain Makefile for fine tuning.

Shreeshrii commented 3 years ago

Output looks similar to that reported in Tesseract issue https://github.com/tesseract-ocr/tesseract/issues/1603

kba commented 3 years ago

tesseract-ocr/tesseract#1603 looks like real OCR issues, whereas the output by @jbarth-ubhd has a 1:1 mapping to the correct prediction. I.e. 8 is consistently i etc. This could probably fixed with a tr(1) call, so I think the character set got jumbled in some weird way.

stweil commented 3 years ago

That could be fixed by replacing the unicharset component in the model file. It looks like somewhere in Jochens workflow a wrong unicharset was used.

Shreeshrii commented 3 years ago

Thanks for the hint @stweil. Following should fix it (haven't tested it though)

lstmtraining --stop_training \
  --continue_from ~/tesstutorial/bembo_from_full/bembo_checkpoint \
  --traineddata tessdata/best/deu.traineddata \
  --model_output ~/tesstutorial/bembo_from_full/deu.traineddata

--traineddata should be same as used in lstmtraining command. The unicharset is used from that.

@kba I had missed the one to one mapping in output.

jbarth-ubhd commented 3 years ago

Thanks! Helps. PS: Did copy the commands from https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00 with this modification: impact -> bembo, eng -> deu.

Shreeshrii commented 3 years ago

Documentation will need to be updated.

Shreeshrii commented 3 years ago

@jbarth-ubhd

You can create single line png and gt.txt files from lstmf files using unpack feature (see https://github.com/tesseract-ocr/tesseract/issues/2669#issuecomment-569782835) from @stweil's repo. Those files can then be used as input for tesstrain.