Running Tesseract 5 training and how I solved the issues I found

Hi there.

Just want to share how I managed to run tesseract training with tesstrain on version 5. It might help other and I hope can be used to improve documentation.

This was my first try on tesseract training, I neved did it before.

I cloned tesseract from git on tag 5.3 and was able to make it exactly as documented here: https://github.com/tesseract-ocr/tessdoc/blob/main/Compiling-–-GitInstallation.md

I performed the installation on Ubuntu running on WSL.

I cloned the latest git for tesstrain and followed this page: https://github.com/tesseract-ocr/tesstrain

That document recommended (https://github.com/tesseract-ocr/tesstrain#provide-ground-truth) trying the train with the ocrd-testset.zip files. I unziped the contents in a folder named 'data/foo-ground-truth/'. The folder named 'data' was created by me to put the files when running make tesseract-langdata as stated in the document.

So I run make training and the result was a lot of error messages:

Can't encode transcription: '<some random german phrase>' in language ''
Encoding of string failed! Failure bytes: <some hexa codes>

Side note: I needed to run it twice, looks like the first command crashes when building the all-gt file.

It was clearly something related to unicharset that has not described the special characters that exists in the samples ground truth.

After studying a while, I decided by my own to replace the unicharset file in data/foo/ with the contents of data/langdata/Latin.unicharset

cp data/langdata/Latin.unicharset data/foo/unicharset

that completely solved the error messages and training finally started.

After some minutes, the BCER train that started at 89% went to 99,9%. Something was clearly wrong again.

I was digging in the web and had a hunch that the issue was related that I haven't specified the starter traineddata, so the training was running from "scratch", don't know.

I then specified the START_MODEL and the result was much better. The BCER started below 20% and continued to improve.

When specifying the starter model, the training process extracts the unicharset from the model and put it in the data/eng folder. I was expecting that eng.traineddata would be using Latin.unicharset, but that seems to not be the case (perhaps the ger.traineddata?), so copying the unicharset is still necessary. For my application I will be using the eng.traineddata, so I decided to continue on english traineddata instead of using the germany traineddata (which I haven't tried).

To have a cleaner run, I decided to run the training in steps. Those were:

# let's start by cleaning the environment
make -r TESSDATA=/usr/local/share/tessdata/ MODEL_NAME=foo START_MODEL=eng clean
make -r TESSDATA=/usr/local/share/tessdata/ MODEL_NAME=foo START_MODEL=eng unicharset
# error expected (creating the foo/all-gt file)
make -r TESSDATA=/usr/local/share/tessdata/ MODEL_NAME=foo START_MODEL=eng unicharset
cp data/langdata/Latin.unicharset data/eng/foo.lstm-unicharset
make -r TESSDATA=/usr/local/share/tessdata/ MODEL_NAME=foo START_MODEL=eng training

I hope this can support Tesseract comunity and any contribution is welcome.

tesseract-ocr / tesstrain

Running Tesseract 5 training and how I solved the issues I found #341