[Question] High error rate after training - why?

jovargas commented 6 years ago

I'm training with a new set of two fonts and the goal is to use Tesseract to analyze individual chars (only capital letters and numbers), not entire words, but the results are far from being decent and it looks like I'm doing something wrong. Tesseract and Leptonica are installed by the scripts.

Inspired by the test set provided in this repo, I created these tif files with their correct gt.txt's:

From original binarized chars:

From two TTFs to TIF images with random text:

At the end of the data creation process I have 1869 mixed text lines.

First I ran the makefile with 10000 iterations as default, but the best error rate was high. I thought it was a matter of needing more iterations, so I changed to 30000 but nothing got better. The following image shows the results:

Sometimes I can see the char train increasing instead of getting lower. What am I missing here? Do I need more training data? Is my initial data not following any important concept? I'd appreciate any help!

kba commented 6 years ago

It's probably best to ask on the tesseract-ocr mailing list, we're only providing a frontend, your problem is caused by tesseract itself.

If it is indeed a problem with how we're laying out directories or defaults given in the Makefile, we'll be happy about a PR.

kba commented 6 years ago

https://groups.google.com/forum/#!topic/tesseract-ocr/qiQLb2-_QlE

jovargas commented 6 years ago

Well, it was me who opened that thread, hehe. Thank you @kba!

tesseract-ocr / tesstrain

[Question] High error rate after training - why? #13