tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
599 stars 178 forks source link

Running Tesseract 5 training and how I solved the issues I found #341

Open mvfpoa opened 1 year ago

mvfpoa commented 1 year ago

Hi there.

Just want to share how I managed to run tesseract training with tesstrain on version 5. It might help other and I hope can be used to improve documentation.

This was my first try on tesseract training, I neved did it before.

I cloned tesseract from git on tag 5.3 and was able to make it exactly as documented here: https://github.com/tesseract-ocr/tessdoc/blob/main/Compiling-–-GitInstallation.md

I performed the installation on Ubuntu running on WSL.

I cloned the latest git for tesstrain and followed this page: https://github.com/tesseract-ocr/tesstrain

That document recommended (https://github.com/tesseract-ocr/tesstrain#provide-ground-truth) trying the train with the ocrd-testset.zip files. I unziped the contents in a folder named 'data/foo-ground-truth/'. The folder named 'data' was created by me to put the files when running make tesseract-langdata as stated in the document.

So I run make training and the result was a lot of error messages:

Can't encode transcription: '<some random german phrase>' in language ''
Encoding of string failed! Failure bytes: <some hexa codes>

Side note: I needed to run it twice, looks like the first command crashes when building the all-gt file.

It was clearly something related to unicharset that has not described the special characters that exists in the samples ground truth.

After studying a while, I decided by my own to replace the unicharset file in data/foo/ with the contents of data/langdata/Latin.unicharset

cp data/langdata/Latin.unicharset data/foo/unicharset

that completely solved the error messages and training finally started.

After some minutes, the BCER train that started at 89% went to 99,9%. Something was clearly wrong again.

I was digging in the web and had a hunch that the issue was related that I haven't specified the starter traineddata, so the training was running from "scratch", don't know.

I then specified the START_MODEL and the result was much better. The BCER started below 20% and continued to improve.

When specifying the starter model, the training process extracts the unicharset from the model and put it in the data/eng folder. I was expecting that eng.traineddata would be using Latin.unicharset, but that seems to not be the case (perhaps the ger.traineddata?), so copying the unicharset is still necessary. For my application I will be using the eng.traineddata, so I decided to continue on english traineddata instead of using the germany traineddata (which I haven't tried).

To have a cleaner run, I decided to run the training in steps. Those were:

# let's start by cleaning the environment
make -r TESSDATA=/usr/local/share/tessdata/ MODEL_NAME=foo START_MODEL=eng clean
make -r TESSDATA=/usr/local/share/tessdata/ MODEL_NAME=foo START_MODEL=eng unicharset
# error expected (creating the foo/all-gt file)
make -r TESSDATA=/usr/local/share/tessdata/ MODEL_NAME=foo START_MODEL=eng unicharset
cp data/langdata/Latin.unicharset data/eng/foo.lstm-unicharset
make -r TESSDATA=/usr/local/share/tessdata/ MODEL_NAME=foo START_MODEL=eng training

I hope this can support Tesseract comunity and any contribution is welcome.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.