tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
620 stars 181 forks source link

Makefile and python scripts for Validation CER plotting #207

Closed Shreeshrii closed 3 years ago

Shreeshrii commented 3 years ago

See https://github.com/Shreeshrii/tesstrain-sanPlusMinus for the repo from which the sample data and plots are taken.

Currently plotting uses makefile, bash and python.

Shreeshrii commented 3 years ago

I want to change this to not use bash scripts.

Edit: Done

Added a Makefile in plot/subdirectory.

Shreeshrii commented 3 years ago

There is still one problem with the Makefile.

The workflow should be:

Currently, the traineddata is getting built from any new checkpoints since last run. I would like lstmeval to be run on these traineddata too. However, it is being run only on traineddata that were existing before the run. See the following as example.

make -C ../ traineddata MODEL_NAME=ocrd
make[1]: Entering directory '/home/ubuntu/tesstrain'
lstmtraining \
          --stop_training \
          --continue_from data/ocrd/checkpoints/ocrd_1.005_2678_6700.checkpoint \
          --traineddata data/ocrd/ocrd.traineddata \
          --model_output data/ocrd/tessdata_best/ocrd_1.005_2678_6700.traineddata
Loaded file data/ocrd/checkpoints/ocrd_1.005_2678_6700.checkpoint, unpacking...
lstmtraining \
          --stop_training \
          --continue_from data/ocrd/checkpoints/ocrd_1.005_2678_6700.checkpoint \
          --traineddata data/ocrd/ocrd.traineddata \
          --convert_to_int \
          --model_output data/ocrd/tessdata_fast/ocrd_1.005_2678_6700.traineddata
Loaded file data/ocrd/checkpoints/ocrd_1.005_2678_6700.checkpoint, unpacking...
make[1]: Leaving directory '/home/ubuntu/tesstrain'
sed -i -e 's/^data/..\/data/' ../data/ocrd/list.validate
OMP_THREAD_LIMIT=1 lstmeval  \
    --verbosity=0 \
    --model ../data/ocrd/tessdata_fast/ocrd_1.057_2602_6500.traineddata \
    --eval_listfile ../data/ocrd/list.validate    > ../data/ocrd/tessdata_fast/ocrd_1.057_2602_6500.validate.log 2>&1
OMP_THREAD_LIMIT=1 lstmeval  \
    --verbosity=0 \
    --model ../data/ocrd/tessdata_fast/ocrd_1.094_2575_6400.traineddata \
    --eval_listfile ../data/ocrd/list.validate    > ../data/ocrd/tessdata_fast/ocrd_1.094_2575_6400.validate.log 2>&1
OMP_THREAD_LIMIT=1 lstmeval  \
    --verbosity=0 \
    --model ../data/ocrd/tessdata_fast/ocrd_1.152_2535_6300.traineddata \
    --eval_listfile ../data/ocrd/list.validate    > ../data/ocrd/tessdata_fast/ocrd_1.152_2535_6300.validate.log 2>&1

So, it requires two runs to get all the validate logs plotted.

Any suggestions on how to fix it.

kba commented 3 years ago

IIUC (haven't tested this yet): You want log files based on traineddata files that are created with make -C ../ traineddata MODEL_NAME=ocrd to be included in $(FAST_VALIDATE_LOG_FILES)? The problem AFAICS is that you assign

FAST_DATA_FILES = $(sort $(wildcard ../data/$(MODEL_NAME)/tessdata_fast/*_[0-$(VALIDATE_CER)].[0-9]*.traineddata))
FAST_VALIDATE_LOG_FILES = $(subst tessdata_fast,tessdata_fast,$(patsubst %.traineddata,%.$(VALIDATE_LIST).log,$(FAST_DATA_FILES)))

at the beginning and this does not reflect the changed list of files. No ready solution but perhaps you could force reevaluation of the expanded variable by delegating to $(shell find) instead of $(wildcard)?

Shreeshrii commented 3 years ago

@kba $(shell find) instead of $(wildcard) is also giving same result. Still need to run the makefile twice.

I have added comments to the makefile and also uploaded the plots from ocrd run.

@stweil Any suggestions on how to fix the need to run make 2 times?