OCR font recognition - Githubissues

Silex commented 10 years ago

Hello,

I used train-ocr to make it learn the switzerland fonts and added it to the eu dataset. When doing some tests with ocr debug set to 1 I was surprised to see that the font used was still the "germany" one, despite analyzing a swiss plate. The result for the plate is correct, tho.

The code showing the font is:

const char* fontName = ri->WordFontAttributes(&dontcare, &dontcare, &dontcare, &dontcare, &dontcare, &dontcare, &pointsize, &fontindex);

Do you have a short explanation of how tesseract is used and how it choses which trained font to use? does it try all of them and pick the best match?

matthill commented 10 years ago

I'm actually not entirely sure how Tesseract decides which font to use. I assume it's just the best match among what's available.

Sometimes similar looking characters will choose a character from a different font. When it's totally different (i.e., a Virginia 'L' versus a New York 'L') it tends to pick the right one.

Currently, the characters are processed individually. So it is likely that some characters will be recognized in one font and others in a different one for the same plate.

I also noticed that removing a font or two from the training set has actually increased my benchmark accuracy in some cases.

Silex commented 10 years ago

Yeah, if I use only swiss font then I detects swiss plates with almost 100% accuracy, but french plates are completely garbage. If I add the swiss font to the eu fonts then french plates are somewhat accurate, the swiss plates are often accurate but it insist on using the german font to match the swiss plate, resulting in O instead of 0.

I'd be nice to be able to specify a font order, or to maybe expose the confidence/font per character, or maybe enforce some logic (e.g if 80% of the fonts are swiss then make them all swiss). If you have more information about this it is welcome, I'll try to investiguate how Tesseract works for this (lots of new things to learn hehe).

matthill commented 10 years ago

The code in ocr.cpp does have information about the source font available. I'm not sure what would happen if we used this data -- maybe accuracy would be improved, but perhaps not.

I think one way of doing it would be to pass the font information (as an index) into the postprocessor. If all the characters for a particular permutation of letters come from the same font, then they get a bonus added to their score.

The only wrinkle with this is that the postprocess function prunes many characters out based on the topN value. Rather than calculate every possible permutation of characters, it removes the lower scoring characters if they could not possibly be in the TopN. Without this feature, the postprocessing time can (in some cases) be extremely slow (hundreds of milliseconds). With this added, postprocess never takes more than a millisecond or so.

If we added a bonus score, it's possible that characters (which would otherwise be in the topN after adding in the bonus) might get pruned too soon. This may not be that big of a deal, though.

mohit-surana commented 8 years ago

Hi @Silex, I am interested in working to implement this feature. Could you share the dataset you used to train the swiss plates or any other guidelines/strategy that I could follow to effectively test this enhancement?

Thanks, Mohit.

Silex commented 8 years ago

@doodhwala: hum, sorry it was a long time ago... but I found most of my plates using google images for "swiss plates" or "switzerland plates".

mohit-surana commented 8 years ago

@Silex I have added the bonus score and like @matthill mentioned, lot of results were pruned but the final result seemed pretty accurate. I need to test with negative samples.

openalpr / openalpr

OCR font recognition #22