tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
620 stars 181 forks source link

script to plot CER from training logfile #203

Closed Shreeshrii closed 3 years ago

Shreeshrii commented 3 years ago

as suggested in https://github.com/tesseract-ocr/tesstrain/issues/200#issuecomment-719579821

Shreeshrii commented 3 years ago

CER Data was extracted as follows:

grep 'Eval Char' /home/ubuntu/tess5training-iast/LAYER.log | sed -e 's/^.*[0-9]At iteration //' | \sed -e 's/,.* Eval Char error rate=/\t/'  | sed -e 's/, Word.*$//' | sed -e 's/^/\t\t/'> plot-eval.txt
grep 'best model' /home/ubuntu/tess5training-iast/LAYER.log |  sed  -e 's/^.*\///' |  sed  -e 's/\.checkpoint.*$//' | sed  -e 's/_/\t/g' | sed -e 's/\(.*\)\t\(.*\)/\1/' > plot-best.txt
grep 'At iteration' /home/ubuntu/tess5training-iast/LAYER.log |  sed -e '/^Sub/d' |  sed -e '/^Update/d' | sed  -e 's/At iteration \([0-9]*\).*char train=/\t\t\1\t\t/' |  sed  -e 's/%, word.*$//'   > plot-iteration.txt
sed 'N;s/\nAt iteration 0, stage 0, /At iteration 0, stage 0, /;P;D' /home/ubuntu/tess5training-iast/CHECKeval.test.log | grep 'Eval Char' | sed -e 's/.checkpoint.*Eval Char error rate=/\t\t\t/' | sed -e 's/, Word.*$//' | sed  -e 's/\(^.*\)_\(.*\)_\(.*\)\t/\1\t\t\2\t\t\t/g' > plot-validation.txt
cat plot-header.txt plot-validation.txt  plot-best.txt plot-eval.txt plot-iteration.txt > plot_cer.csv
python plot_cer.py

plot_cer.csv.txt plot_cer

Shreeshrii commented 3 years ago

@kba I have added the info about how to run this in README.

|tee -a LAYER.log does not capture the output from lstmtraining. I use nohup for capturing all output.

LAYER training is NOT currently supported by Makefile. I had used logfile from an independently run `replace top layer training'.

I used the sample set provided in this repo and ran training from scratch as well as with START-MODEL. Those plots look a bit different. See below.

ocrd-plot_cer