tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
604 stars 178 forks source link

Will training on images that tesseract already can OCR correctly help it OCR similar images but with different content ? #272

Closed AdnanCukur closed 2 years ago

AdnanCukur commented 2 years ago

So my scans are pretty accurate right now, but every now and then it scans something incorrectly 0.1% error rate. I dont have that many error images to train it on, so my question is if it will help to train it on the images that it can OCR correctly ?

kba commented 2 years ago

If your scans are consistent, using the same font/script/language, training a task-specific model based on an existing model can indeed improve recognition. Whether it will be a significant improvement over the 99.9% accuracy remains to be seen though, that is already very good recognition.

But I would still go for it, because as a general rule, it's better to invest time/effort in better recognition than in handling errors afterwards.

Also you don't just want to train on the erroneous images because that would likely lead to overfitting, but train it with a representative sample that includes both "easy" and "hard" detection problems.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.