Training on real world images

mikylucky commented 5 years ago

Hi there, I need to train over real world images.

I'm already preprocessing them, but there will always be some "dirt" on them.

Is tesstrain able to learn how to read also partially covered characters?

stweil commented 5 years ago

I suggest to try it (and report the result here).

wrznr commented 5 years ago

Or post some examples for a wild guess from experienced users. :smile:

mikylucky commented 5 years ago

This is an example of image on which I need to train, I'm interested only in the biggest number/text. I'm leaving at the moment the preprocessing and the binarization to tesseract.

_SW50581_1570362338_2_gray

At the moment I'm doing some tests (mostly to understand how tesstrain works) with 528 ground truth images, but I'm increasing it gradually. The dataset is really variable, with many different fonts, partially covered text, deformed and rotated text.

Do you have any suggestions on how to increase the training performance on these kind of images?

wrznr commented 5 years ago

First of all, Tesseract's recognition procedure works on the line level. Feeding images like you posted won't work from scratch. You have to separate them into lines a) for training and b) at run time. In addition, it is usually very helpful to deskew and/or dewarp (https://wrznr.github.io/IT-Kolloquium-2019/#29) the images. I think your specific application is mainly a preprocessing and not so much an OCR challenge. You may want to have a look at https://scantailor.org/, https://github.com/Flameeyes/unpaper and https://github.com/ZaMaZaN4iK/PRLib/.

mikylucky commented 5 years ago

@wrznr thanks for the great resources!

The image I provided you it's a single line for me. I'm not interested in the small text, I consider it noise.

I pushed preprocessing till now, I started with very low tesseract performances to have something like 40% accuracy at the moment. The best working preprocessing I do at the moment is the one locating the digits inside the image and cropping them.

I've tried many strategies at image level (binarization, noise removal, text extraction with EAST, Watershed and so on), but none of the outperformed tesseract's internal preprocessing.

I also tried cleaning some images manually, but tesseract failed detecting the right text probably because of the font.

This is why I decided to proceed with custom Tesseract training.

As an update, at the moment I have a ground truth dataset of more or less 700 images, and after 500.000 iterations it starts improving on a different validation test. Training gave me 36% error rate, validation gave me 77% error rate. But the previous attempt was 100% error rate on the validation :D

I'll keep you updated on the progresses, I just launched a new training for 1 million iterations, let's see what happens.

Does all of this make sense to you?

wrznr commented 5 years ago

If you make sure that you use the same page segmentation mode (either 7 or 13) during runtime, it might work for some examples. But I do not think that it is the right way to try teach the character recognition to perform image clean up and that is what you're setup looks like. Keep in mind how character recognition with recurrent networks works: The line image is scaled to a uniform height and than split into many columns of binary values (i.e. black and white pixels). These columns (technically vectors) are the atoms on which the sequence classification is based upon. You train the relation between sequences of those binary-valued columns and the characters in the corresponding text files. But this relation is not consistent in your setup since the lines above and below the actual target are subject to high variance. I.e. the classifier will frequently stumble upon events (columns) it has not seen during training and thus does not “know”. The confidences will be very small and effectively lead into the realm of guessing. More iterations won't help you with that problem. They may even worsen the situation, since you pretend a relation between image and text which is not real. But of course, this is just speculation from my experiences in machine learning. If you are lucky the model may learn the difference between reliable and unreliable cells in the columns.

mikylucky commented 5 years ago

Thanks for the explanation, I indeed do not know what's the logic behind the training, I'm going blind at the moment because it's my first time in the topic and progressing by attempts.

I'm already using PSM 13 as you suggested. Without custom training it worked pretty well, it was failing I think only for a font matter.

I'll try increasing the ground truth dataset, to see if there will be any improvement in the performances. If not, as you forecast, I'll try to train on preprocessed images.

You talked about binary vectors, but I'm providing grayscale images as requested in the readme. Should I binarize by myself before the training?

wrznr commented 5 years ago

Not necessary to binarize as far as I know. @stweil ?

mikylucky commented 5 years ago

As an update, I tried feeding the 1 million iterations model with the not preprocessed validation dataset, I had a huge performance boost.

Preprocessing + digits.traineddata = 42% accuracy Dirty image + custom.traineddata = 46% accuracy

Thus, seems the model is learning also how to clean the images!

I'm trying to understand when the model will overfit and meanwhile increase the training dataset

mikylucky commented 5 years ago

I can confirm it is improving, I'll close this one

nebiyebln commented 3 years ago

hello i am making a license plate recognition system and working with real world data like you. some letters and numbers are not readable or misread. Like you, I need to increase the accuracy rate, but I couldn't find the right source. I tried training tesseract library. but only with clean images. can you help me please!! @mikylucky

nebiyebln commented 3 years ago

@mikylucky please help me

tesseract-ocr / tesstrain

Training on real world images #104