tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
626 stars 180 forks source link

[Question] High error rate after training - why? #13

Closed jovargas closed 6 years ago

jovargas commented 6 years ago

I'm training with a new set of two fonts and the goal is to use Tesseract to analyze individual chars (only capital letters and numbers), not entire words, but the results are far from being decent and it looks like I'm doing something wrong. Tesseract and Leptonica are installed by the scripts.

Inspired by the test set provided in this repo, I created these tif files with their correct gt.txt's:

From original binarized chars: image

From two TTFs to TIF images with random text: image image

At the end of the data creation process I have 1869 mixed text lines.

First I ran the makefile with 10000 iterations as default, but the best error rate was high. I thought it was a matter of needing more iterations, so I changed to 30000 but nothing got better. The following image shows the results: image

Sometimes I can see the char train increasing instead of getting lower. What am I missing here? Do I need more training data? Is my initial data not following any important concept? I'd appreciate any help!

kba commented 6 years ago

It's probably best to ask on the tesseract-ocr mailing list, we're only providing a frontend, your problem is caused by tesseract itself.

If it is indeed a problem with how we're laying out directories or defaults given in the Makefile, we'll be happy about a PR.

kba commented 6 years ago

https://groups.google.com/forum/#!topic/tesseract-ocr/qiQLb2-_QlE

jovargas commented 6 years ago

Well, it was me who opened that thread, hehe. Thank you @kba!