tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
620 stars 181 forks source link

Makefile based plotting #236

Closed Shreeshrii closed 2 years ago

M3ssman commented 3 years ago

@Shreeshrii I was trying to get on with the plotting, but I stumbled upon the term "validationlist". What does this mean in this context? AFAIU, the plotting requires an already existing *.traineddata-model, which is passed to lstmeval Tool. Do I need, besides my previous training data, an additional list of lstmf-files to measure the performance? By now, I do it the hard way, which means put the new model into tessconfigs and do a fresh run with real image data.

Shreeshrii commented 3 years ago

Usually for training we use a training list and eval list. You can use a different dataset for validation, if you want.

tfukumori commented 3 years ago

@Shreeshrii Thanks for the great feature! I was able to draw a plot.

But I could not get the result of "Make TSV with Eval CER". It seems that Eval Char is not included in data/ocrd.log, so it was not output.

I think there is a problem with my environment or steps, but I would appreciate your advice if you can help me.

Environment and Steps.

Environment

$ tesseract --version
tesseract 5.0.0-alpha-20210401
 leptonica-1.80.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1

Steps

I think "validate" should be prepared separately from "eval", but for now, I have set the same thing as "eval" in "validate".

wget https://github.com/tesseract-ocr/tessdata_best/raw/master/frk.traineddata -O ~/tessdata_best/frk.traineddata

unzip ocrd-testset.zip -d data/ocrd-ground-truth

nohup make training MODEL_NAME=ocrd \
    START_MODEL=frk \
    TESSDATA=~/tessdata_best \
    MAX_ITERATIONS=10000 > data/ocrd.log

cp data/ocrd/list.eval data/ocrd/list.validate

bash -x plot.sh ocrd validate 10

Result

The logs did not contain any logs about "eval", unlike the logs described in the Issue.

Log

ocrd.log.zip

TSV

ocrd-validate-cer.tsv.zip

Plot

ocrd-validate-cer

stweil commented 2 years ago

Can we merge this pull request, or would anybody suggest additional changes?

Shreeshrii commented 2 years ago

@bertsky

Thank you for your detailed feedback. I agree that this may not be the best way to implement the plotting facility for tesseract training. The scripts are what I used at that point of time and it helped me visualise the training process and results.

This PR can be closed if you or others provide an alternative implementation.

On Mon, Sep 6, 2021, 15:16 Robert Sachunsky @.***> wrote:

@.**** requested changes on this pull request.

While I do very much welcome your effort to integrate a decent plotting facility, I have strong reservations about this implementation and about the way this PR is set up technically.

Beyond the comments given inline, I see two general questions:

  1. How do the two old Python scripts relate to the single new one? (There seems to be lots of code re-use. It's hard to compare to the previous version this way.)
  2. How does this relate to the bigger problem https://github.com/tesseract-ocr/tesstrain/issues/261 of CER not being calculated correctly by lstmtraining/lstmeval and checkpoints only being created at arbitrary intervals?

In plot/Makefile https://github.com/tesseract-ocr/tesstrain/pull/236#discussion_r702270009 :

@@ -0,0 +1,132 @@

+# Name of the model to be evaluated. Default: $(MODEL_NAME)

+MODEL_NAME = foo

+

+# Suffix of list of lstmf files to be used as validation set e.g. list.validate. Default: $(VALIDATE_LIST)

+VALIDATE_LIST=validate

+

+# Integer part of maximum Validation CER, ONLY use values between 0-9. Default: $(VALIDATE_CER)

+VALIDATE_CER=9

+

+# Training log file. This should match logfile name from training. Default: $(MODEL_LOG)

+MODEL_LOG =../data/$(MODEL_NAME).log

Seeing ../data here, this assumes that the toplevel training used the default DATA_DIR=data, which is overly restrictive. I suggest either rewriting in terms of a DATA_DIR variable as well, or by recursively calling the plot/Makefile with everything exported from the toplevel Makefile.

In plot/Makefile https://github.com/tesseract-ocr/tesstrain/pull/236#discussion_r702270524 :

  • @find ../data/$(MODEL_NAME)/checkpoints/ -regex ^.$(MODELNAME)[0-9].[0-9]_..*.checkpoint -exec rename -v 's/(.[0-9])/$${1}00_/' {} \;
  • @find ../data/$(MODEL_NAME)/checkpoints/ -regex ^.$(MODELNAME)[0-9].[0-9][0-9].*.*.checkpoint -exec rename -v 's/(.[0-9][0-9])/$${1}0/' {} \;

What is rename – it's certainly not POSIX, and does not work on Linux.

In plot/Makefile https://github.com/tesseract-ocr/tesstrain/pull/236#discussion_r702270896 :

  • @grep 'At iteration' $(MODEL_LOG) \
  • | sed -e '/^Sub/d' \

  • | sed -e '/^Update/d' \

  • | sed -e '/^ New worst char/d' \

  • | sed -e 's/At iteration ([0-9])\/([0-9])\/.*char train=/\t\t\1\t\2\t\t/' \

  • | sed -e 's/%, word.*/\t/' >> "$@"

I would strongly recommend generating a CSV/TSV file directly from lstmtraining.cpp instead of trying to parse its output strings here – any slight change there (different wording/formatting or additional lines) would break this.

In plot/Makefile https://github.com/tesseract-ocr/tesstrain/pull/236#discussion_r702273105 :

+help:

  • @echo ""

  • @echo " Targets"

  • @echo ""

  • @echo " traineddata Create best and fast .traineddata files from each .checkpoint file"

  • @echo " plotvalidatecer Make plots from TSV files generated from training and eval logs"

  • @echo ""

  • @echo " Variables"

  • @echo ""

  • @echo " MODEL_NAME Name of the model to be built. Default: $(MODEL_NAME)"

  • @echo " VALIDATE_LIST Suffix of lstmf files list, use validate for list.validate. Default: $(VALIDATE_LIST)"

  • @echo ""

+# END-EVAL

+

+.PHONY: $(TMP_100_ITERATIONS) $(TMP_CHECKPOINT) $(TMP_EVAL) $(TMP_VALIDATE) $(TSV_VALIDATE_CER) $(TMP_VALIDATE_LOG) $(TMP_FAST_LOG)

These are concrete file targets, marking them as phony is wrong.

In plot/Makefile https://github.com/tesseract-ocr/tesstrain/pull/236#discussion_r702714781 :

  • echo "Name CheckpointCER LearningIteration TrainingIteration EvalCER IterationCER ValidationCER" > "$@" ;\
  • else \

  • grep -E "$(VALIDATE_LIST).log$$|iteration" $(TMP_FAST_LOG) > $(TMP_VALIDATE_LOG) ;\

  • echo "Name CheckpointCER LearningIteration TrainingIteration EvalCER IterationCER ValidationCER" > "$@" ;\

  • sed 'N;s/\nAt iteration 0, stage 0, /At iteration 0, stage 0, /;P;D' $(TMP_VALIDATE_LOG) \

  • | grep 'Eval Char' \

  • | sed -e "s/.$(VALIDATE_LIST).log.*Eval Char error rate=/\t\t\t/" \

  • | sed -e 's/, Word.*$$//' \

  • | sed -e 's/(^.)_([0-9].)([0-9].*)([0-9].*)\t/\1\t\2\t\3\t\4\t/g' >> "$@" ;\

  • fi;

+# Build fast traineddata file list with CER in range [0-VALIDATE_CER].[0-9].

+FAST_DATA_FILES := $(shell find ../data/$(MODEL_NAME)/tessdata_fast/ -type f -name $(MODELNAME)[0-$(VALIDATECER)].[0-9]**.traineddata | sort -n -r)

+

+# Build validate log files list based on above traineddata list.

+FAST_LOG_FILES := $(subst tessdata_fast,tessdata_fast,$(patsubst %.traineddata,%.$(VALIDATE_LIST).log,$(FAST_DATA_FILES)))

The outer subst is a no-op. ⬇️ Suggested change

-FAST_LOG_FILES := $(subst tessdata_fast,tessdata_fast,$(patsubst %.traineddata,%.$(VALIDATE_LIST).log,$(FAST_DATA_FILES)))

+FAST_LOG_FILES := $(FAST_DATA_FILES:%.traineddata=%.$(VALIDATE_LIST).log)


In plot/Makefile https://github.com/tesseract-ocr/tesstrain/pull/236#discussion_r702723214 :

  • @echo " VALIDATE_LIST Suffix of lstmf files list, use validate for list.validate. Default: $(VALIDATE_LIST)"
  • @echo ""

+# END-EVAL

+

+.PHONY: $(TMP_100_ITERATIONS) $(TMP_CHECKPOINT) $(TMP_EVAL) $(TMP_VALIDATE) $(TSV_VALIDATE_CER) $(TMP_VALIDATE_LOG) $(TMP_FAST_LOG)

+

+# Rename checkpoints with one/two decimal digits to 3 decimal digts for correct sorting later.

+# Run Makefile in main directory to create traineddata from all checkpoints.

+# Add ../ to lstmf file names in validate list relative to plot subdirectory.

+traineddata:

  • @find ../data/$(MODEL_NAME)/checkpoints/ -regex ^.$(MODELNAME)[0-9].[0-9]_..*.checkpoint -exec rename -v 's/(.[0-9])/$${1}00_/' {} \;

  • @find ../data/$(MODEL_NAME)/checkpoints/ -regex ^.$(MODELNAME)[0-9].[0-9][0-9].*.*.checkpoint -exec rename -v 's/(.[0-9][0-9])/$${1}0/' {} \;

  • $(MAKE) -C ../ traineddata MODEL_NAME=$(MODEL_NAME)

  • @mkdir -p $(PLOT_DIR)

  • @cp ../data/$(MODEL_NAME)/list.${VALIDATE_LIST} $(TMP_VALIDATE_LIST)

This line creates TMP_VALIDATE_LIST, the rule therefore should mark this file as its target (perhaps as a dependent sub-rule).

But I don't see how the default list.validate should get created in the first place. So far, tesstrain only creates list.train and list.eval. (And since it does not make any actual use of the eval files, not for checkpointing and not even for checkpoint selection, I don't see the merit in providing a second hold-out set. If you have a manual split, you could easily pass that into list.train / list.eval already.)

In plot/Makefile https://github.com/tesseract-ocr/tesstrain/pull/236#discussion_r702724173 :

  • @echo ""

+

+# END-EVAL

+

+.PHONY: $(TMP_100_ITERATIONS) $(TMP_CHECKPOINT) $(TMP_EVAL) $(TMP_VALIDATE) $(TSV_VALIDATE_CER) $(TMP_VALIDATE_LOG) $(TMP_FAST_LOG)

+

+# Rename checkpoints with one/two decimal digits to 3 decimal digts for correct sorting later.

+# Run Makefile in main directory to create traineddata from all checkpoints.

+# Add ../ to lstmf file names in validate list relative to plot subdirectory.

+traineddata:

  • @find ../data/$(MODEL_NAME)/checkpoints/ -regex ^.$(MODELNAME)[0-9].[0-9]_..*.checkpoint -exec rename -v 's/(.[0-9])/$${1}00_/' {} \;

  • @find ../data/$(MODEL_NAME)/checkpoints/ -regex ^.$(MODELNAME)[0-9].[0-9][0-9].*.*.checkpoint -exec rename -v 's/(.[0-9][0-9])/$${1}0/' {} \;

  • $(MAKE) -C ../ traineddata MODEL_NAME=$(MODEL_NAME)

  • @mkdir -p $(PLOT_DIR)

  • @cp ../data/$(MODEL_NAME)/list.${VALIDATE_LIST} $(TMP_VALIDATE_LIST)

  • @sed -i -e 's/^data/..\/data/' $(TMP_VALIDATE_LIST)

Again, this makes too strong directly assumptions. Even if you pass in DATA_DIR from the main makefile, you need to calculate the exact relative path, or use absolute paths entirely.

In plot.sh https://github.com/tesseract-ocr/tesstrain/pull/236#discussion_r702741642 :

+cd plot

+make MODEL_NAME=$1 VALIDATE_LIST=$2 Y_MAX_CER=$3

+make MODEL_NAME=$1 VALIDATE_LIST=$2 Y_MAX_CER=$3

If you need a shell script to run a makefile, then that makefile is poorly written (or documented).

Assuming it is a good choice to have a separate makefile for plotting in the first place, I think the plot/Makefile should just require/assume that the toplevel Makfile's traineddata target has already been run.

In the very least, this should read make traineddata && make plotvalidatecer, not make all; make all.

In plot/Makefile https://github.com/tesseract-ocr/tesstrain/pull/236#discussion_r702744663 :

  • fi;

+

+# Build fast traineddata file list with CER in range [0-VALIDATE_CER].[0-9].

+FAST_DATA_FILES := $(shell find ../data/$(MODEL_NAME)/tessdata_fast/ -type f -name $(MODELNAME)[0-$(VALIDATECER)].[0-9]**.traineddata | sort -n -r)

+

+# Build validate log files list based on above traineddata list.

+FAST_LOG_FILES := $(subst tessdata_fast,tessdata_fast,$(patsubst %.traineddata,%.$(VALIDATE_LIST).log,$(FAST_DATA_FILES)))

+

+#Note: This does not find the new traineddata files from current run.

+# Hence make needs to be run twice to generate new validate.log files.

+

+$(FAST_LOG_FILES): %.$(VALIDATE_LIST).log: %.traineddata

  • OMP_THREAD_LIMIT=1 time -p lstmeval \

  • --verbosity=0 \

  • --model $< \

  • --eval_listfile $(TMP_VALIDATE_LIST) 2>&1 | grep "^At iteration" > $@

This rule needs still to advertise all its dependencies, esp. TMP_VALIDATE_LIST and FAST_DATA_FILES.

In plot/Makefile https://github.com/tesseract-ocr/tesstrain/pull/236#discussion_r702745733 :

+# Build validate log files list based on above traineddata list.

+FAST_LOG_FILES := $(subst tessdata_fast,tessdata_fast,$(patsubst %.traineddata,%.$(VALIDATE_LIST).log,$(FAST_DATA_FILES)))

+

+#Note: This does not find the new traineddata files from current run.

+# Hence make needs to be run twice to generate new validate.log files.

+

+$(FAST_LOG_FILES): %.$(VALIDATE_LIST).log: %.traineddata

  • OMP_THREAD_LIMIT=1 time -p lstmeval \

  • --verbosity=0 \

  • --model $< \

  • --eval_listfile $(TMP_VALIDATE_LIST) 2>&1 | grep "^At iteration" > $@

+# Concatenate all validate.log files along with their filenames so as to include iteration number for TSV.

+$(TMP_FAST_LOG): $(FAST_LOG_FILES)

  • @for i in $^; do \

  • echo Filename : "$$i";echo;cat "$$i"; \

What are the Filename lines used for?

In plot/plot-eval-validate-cer.py https://github.com/tesseract-ocr/tesstrain/pull/236#discussion_r702749893 :

+ytsvfile = "tmp-" + args.model + "-" + args.validatelist + "-iteration.tsv"

+ctsvfile = "tmp-" + args.model + "-" + args.validatelist + "-checkpoint.tsv"

+etsvfile = "tmp-" + args.model + "-" + args.validatelist + "-eval.tsv"

+vtsvfile = "tmp-" + args.model + "-" + args.validatelist + "-validate.tsv"

+plotfile = "../data/" + args.model + "/plot/" + args.model + "-" + args.validatelist + "-cer.png"

Hard-coding the values in Python redundantly instead of passing the fully defined path names from the makefile is really bad style.

In plot/Makefile https://github.com/tesseract-ocr/tesstrain/pull/236#discussion_r702751174 :

+$(FAST_LOG_FILES): %.$(VALIDATE_LIST).log: %.traineddata

  • OMP_THREAD_LIMIT=1 time -p lstmeval \

  • --verbosity=0 \

  • --model $< \

  • --eval_listfile $(TMP_VALIDATE_LIST) 2>&1 | grep "^At iteration" > $@

+# Concatenate all validate.log files along with their filenames so as to include iteration number for TSV.

+$(TMP_FAST_LOG): $(FAST_LOG_FILES)

  • @for i in $^; do \

  • echo Filename : "$$i";echo;cat "$$i"; \

  • done > $@

+# Combine TSV files with all required CER values, generated from training log and validation logs. Plot.

+$(TSV_VALIDATE_CER): $(TMP_FAST_LOG) $(TMP_100_ITERATIONS) $(TMP_CHECKPOINT) $(TMP_EVAL) $(TMP_VALIDATE)

  • @cat $(TMP_100_ITERATIONS) $(TMP_CHECKPOINT) $(TMP_EVAL) $(TMP_VALIDATE) > "$@"

  • python plot-eval-validate-cer.py -m $(MODEL_NAME) -v $(VALIDATE_LIST) -y $(Y_MAX_CER)

This should really be a separate rule with its own target – the concret plot PNG – and passing all the path names of the necessary input files.

Relying on Y_MAX_CER to be defined outside the makefile (instead of, say, a default), is bad style.

In plot/Makefile https://github.com/tesseract-ocr/tesstrain/pull/236#discussion_r702752622 :

  • OMP_THREAD_LIMIT=1 time -p lstmeval \
  • --verbosity=0 \

  • --model $< \

  • --eval_listfile $(TMP_VALIDATE_LIST) 2>&1 | grep "^At iteration" > $@

+# Concatenate all validate.log files along with their filenames so as to include iteration number for TSV.

+$(TMP_FAST_LOG): $(FAST_LOG_FILES)

  • @for i in $^; do \

  • echo Filename : "$$i";echo;cat "$$i"; \

  • done > $@

+# Combine TSV files with all required CER values, generated from training log and validation logs. Plot.

+$(TSV_VALIDATE_CER): $(TMP_FAST_LOG) $(TMP_100_ITERATIONS) $(TMP_CHECKPOINT) $(TMP_EVAL) $(TMP_VALIDATE)

  • @cat $(TMP_100_ITERATIONS) $(TMP_CHECKPOINT) $(TMP_EVAL) $(TMP_VALIDATE) > "$@"

  • python plot-eval-validate-cer.py -m $(MODEL_NAME) -v $(VALIDATE_LIST) -y $(Y_MAX_CER)

  • @rm tmp-$(MODEL_NAME)-$(VALIDATE_LIST).

Should be placed into a separate (phony) rule (like clean), and instead of redefining the filenames implicitly (tmp-*), re-use the actual above definitions.

In plot/Makefile https://github.com/tesseract-ocr/tesstrain/pull/236#discussion_r702270711 :

  • @find ../data/$(MODEL_NAME)/checkpoints/ -regex ^.$(MODELNAME)[0-9].[0-9]_..*.checkpoint -exec rename -v 's/(.[0-9])/$${1}00_/' {} \;
  • @find ../data/$(MODEL_NAME)/checkpoints/ -regex ^.$(MODELNAME)[0-9].[0-9][0-9].*.*.checkpoint -exec rename -v 's/(.[0-9][0-9])/$${1}0/' {} \;

At any rate, I would prefer changing lstmtraining.cpp to simply create zero-padded file names in the first place.

In plot/Makefile https://github.com/tesseract-ocr/tesstrain/pull/236#discussion_r702718779 :

  • @find ../data/$(MODEL_NAME)/checkpoints/ -regex ^.$(MODELNAME)[0-9].[0-9]_..*.checkpoint -exec rename -v 's/(.[0-9])/$${1}00_/' {} \;
  • @find ../data/$(MODEL_NAME)/checkpoints/ -regex ^.$(MODELNAME)[0-9].[0-9][0-9].*.*.checkpoint -exec rename -v 's/(.[0-9][0-9])/$${1}0/' {} \;

Then again, I don't even see the need for zero padding filenames at all: you can always sort them correctly via sort -n.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesstrain/pull/236#pullrequestreview-746561608, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I3HSQVNVLEW5BP2HNDUASEYBANCNFSM4ZI5G4CQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

bertsky commented 2 years ago

Thanks for your fast feedback @Shreeshrii. In that case I suggest we do close, and (being very much interested in this feature) I promise I'll revisit as soon as #261 is out of the way. (I will strive for a close integration with lstmtraining to output data files, and with the tesstrain makefile, and then re-use your plotting code.)

Shreeshrii commented 2 years ago

@bertsky I am closing this. Hope you will add a better implementation soon.

whisere commented 2 years ago

I have the same issue as https://github.com/tesseract-ocr/tesstrain/pull/236#issuecomment-858253025 also reported in https://github.com/Shreeshrii/tesstrain-ben/issues/1

Shreeshrii commented 2 years ago

@bertsky is planning to improve the output CER as discussed in above thread and will redo the plotting feature.

I have added some more hacks to the scripts for my own personal use based on the above feedback. I will look to posting them in a repo and post a link here as a workaround till the official update.

whisere commented 2 years ago

That sounds great! Thank you!

Shreeshrii commented 2 years ago

See https://github.com/Shreeshrii/tess5train-fonts

https://github.com/Shreeshrii/tess5train-fonts/blob/main/data/engFineTuned/plots/engFineTuned-LOG-2.png

https://github.com/Shreeshrii/tess5train-fonts/blob/main/data/engLayer/plots/engLayer-LOG-2.png

and also

https://github.com/Shreeshrii/tess5train-fonts/blob/main/data/engLayer/plots/engLayer-2.png

https://github.com/Shreeshrii/tess5train-fonts/blob/main/data/engFineTuned/plots/engFineTuned-2.png

whisere commented 2 years ago

Many thanks! I will try that!

whisere commented 2 years ago

Just to clarify am I right that it still doesn't plot eval data for fine tuned training (only for replacing a layering training)?

Shreeshrii commented 2 years ago

The scripts create a tsv from the log file generated during training process. If tesseract does not run it then the info won't be in the log file and won't get plotted. Also, the eval info does not include the training iteration number, it only has the learning iteration number.

As an alternative I run lstmeval on each of the checkpoints and plot that separately, that is the lstmeval after training. I have also added impact centre's ocreval as well as ISRI evaluation's accuracy info. Plotting of accuracy is not yet implemented.

whisere commented 2 years ago

Many thanks for the excellent work! I have tried, but the script doesn't work with the ground truth tif and txt pairs for fine tuning we have... it seems to require model.training_text file and reports:

[01:32:54] INFO - === Starting training for language eng [01:32:54] INFO - Testing font: Arial Bold [01:32:54] ERROR - Could not find font named Arial Bold. Pango suggested font DejaVu Sans Bold. Please correct --font arg.

[01:32:54] INFO - Program /usr/bin/text2image failed with return code 1. Abort. [01:32:54] INFO - === Phase I: Generating training images === [01:32:54] CRITICAL - Required/expected file 'data/ground-truth/modelname-eval.training_text' does not exist Makefile:437: recipe for target 'data/gtd/list.eval' failed make: *** [data/gtd/list.eval] Error 1

with nohup bash 2-training.sh eng Latin eng modelname FineTune 9999 > data/logs/modelname.LOG &

I will keep investigating.

Shreeshrii commented 2 years ago

Yes, as my repo name indicates this is the version for training from fonts and training text. However the plotting part only depends on the log file from training.

I will upload a different version that works with existing tesstrain makefile.

whisere commented 2 years ago

Amazing! I will try when it is available : )

And the lstmeval commands does work and are useful with CER and WER output with the original make output, although not showing in plotting:

nohup lstmeval --model data/modelname.traineddata --eval_listfile data/modelname/list.eval --verbosity 2 > data/modelname-eval-list.log &

nohup lstmeval --model data/eng.traineddata --eval_listfile data/modelname/list.eval --verbosity 2 > data/eng-eval-list.log &