LSTM: Non-dictionary words with combination of letters and numbers not recognized.

tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)

https://tesseract-ocr.github.io/

Apache License 2.0

62.81k stars 9.54k forks source link

LSTM: Non-dictionary words with combination of letters and numbers not recognized. #733

Open Shreeshrii opened 7 years ago

Shreeshrii commented 7 years ago

https://groups.google.com/d/msgid/tesseract-ocr/1a3e8773-7151-48f9-92bb-fda888293eab%40googlegroups.com?utm_medium=email&utm_source=footer

While the single "S" is recognized correctly, the text "2S" is recognized as "25".

Here is link to the test image:

https://03054610326450256607.googlegroups.com/attach/b8b86693ac072/2s.png?part=0.4&view=1

Shreeshrii commented 7 years ago

On 22-Feb-2017 9:02 PM, "Amit D." notifications@github.com wrote:

The lstm engine is train on text-lines images. and learns from context, so it does not surprise me that for a single glyph the OCR accuracy is not so good.

So, is this another case where legacy engine is better than LSTM?

excuse the brevity, sent from mobile

andrewisplinghoff commented 7 years ago

Yes, the legacy engine (--oem 0) gets this one right.

tesseract4 --psm 7 --oem 0 2s.png 2s-out-oem0-psm7.txt

2s-out-oem0-psm7.txt

Shreeshrii commented 6 years ago

@zdenop Please label : accuracy.

Shreeshrii commented 6 years ago

Another instance reported in forum, in context of recognizing license plates.

Please see https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/tesseract-ocr/qxB-aCa3r6E

Test image is

minus-4l

Shreeshrii commented 6 years ago

numbers-dawg has patterns of numbers with punctuation and letters. However currently there is no way to specify patterns such as license plates, VIN, product IDs which are non-dictionary words and random combinations of numbers and letters.

Here are the other two images from error reports:

minus-0o

@theraysmith

Is there a variable which can be set for better accuracy in such cases?

Shreeshrii commented 6 years ago

Another issue, reported in the forum

https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/tesseract-ocr/6a6sKOXdZsA

I to 1 A to 4

- an image containing "12345678I" => `123456781`
- an image containing "GLOTHUVFI" => `GLOTHUVFI`
- an image containing "12345678H" => `12345678H`
- an image containing "GLOTHUVFH" => `GLOTHUVFH`
- an image containing "12345678A" => `123456784`
- an image containing "GLOTHUVFA" => `GLOTHUVFA`

kolakao commented 5 years ago

Unfortunately, I've fallen into the same pit, is there any solution yet maybe? I guess I've tried everything and all the topics regarding that matter in the internet are left without the solution.

FrancescoSaverioZuppichini commented 4 years ago

Same problem here

ghost commented 2 years ago

Hello, do you have datasets somewhere available for testing?

SHANDLEMAN commented 2 years ago

This thread has been open for 5 years. Has anyone come up with a method for reliably getting tesseract to read a combination of letters and numbers?