Tesseract trained on handwriting data - when being tested it only outputs the letter "e"

KPLinux commented 4 months ago

Trained using this dataset: https://www.kaggle.com/datasets/nibinv23/iam-handwriting-word-database/data

I am creating an OCR model that is meant to recognize human handwriting. I extracted the image files and created separate ground truth text files for each image and followed the training process as explained in the README. The makefile ran well and no errors occurred except for one corrupted file that was unable to be read, so I just deleted it and continued the training, after which the process terminated successfully.

However, when I tried testing tesseract by having it run on some testing handwriting samples, it always gave back "e". Just to check if it's not an issue with the engine itself, I tested it with the english traineddata and (while it returned gibberish) it worked fine in the sense that each image returned a different value. But my trained model only outputs "e" for every input image.

Is there a way to fix this? I am quite new to AI and ML and just programming in general, so any tips/suggestions would be appreciated

stweil commented 4 months ago

Tesseract's layout detection is not able to separate the text lines in handwritten text. It was only designed for printed text. Therefore your newly trained model would work for single lines (with the right --psm parameter), but not for typical handwritten text.

Use kraken or other software which supports text recognition for handwritings.

KPLinux commented 4 months ago

Thanks for the info. The model I am training is only meant to detect single lines of text, not multi-line sentences or paragraphs, so I thought tesstrain would help me train a custom Tesseract model for this purpose.

stweil commented 4 months ago

Then try --psm 7 or --psm 13 and pass the line image to Tesseract.

KPLinux commented 4 months ago

I tried both, they still return "e" and nothing else. I guess I'll have to learn how to use kraken, unless there's some other method to go about this. Either way, thanks for your help

tesseract-ocr / tesstrain

Tesseract trained on handwriting data - when being tested it only outputs the letter "e" #395