LSTM engine needs more training

tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)

https://tesseract-ocr.github.io/

Apache License 2.0

60.86k stars 9.36k forks source link

LSTM engine needs more training #2221

Open YamashitaRen opened 5 years ago

YamashitaRen commented 5 years ago

According to Tesseract 4.0.0 Release Notes :

Added a new OCR engine that uses neural network system based on LSTMs, with major accuracy gains.

My testing with this new OCR engine does not corroborate this statement. The results I get are vastly inferior to the ones I was getting with the legacy engine. (As in, if there is difference in the two output, 99 times out of 100, the legacy engine's result will be WAY better.)

Example : the output of Tesseract v4.0.0 when used on this picture with tessdata_best fra.traineddata is "ROUTIER".

This issue seems critical as LSTM engine is now the default and distros like Ubuntu 18.04 don't provide traineddata including legacy models anymore.

dagnelies commented 5 years ago

From a brief test, it appears following models are "broken":

Only this one appears to produce meaningful results:

https://github.com/tesseract-ocr/tessdata_fast/raw/master/fra.traineddata

Shreeshrii commented 5 years ago

@dagnelies

Please share the test image, command used and error received to identify what is broken.

dagnelies commented 5 years ago

Hi @Shreeshrii

I used the same image as the OP ( https://user-images.githubusercontent.com/4103637/52183232-3d9b1800-2806-11e9-831d-25aa090eab34.jpg )

The first two traineddata resulted in "ROUTIER" (one of them with an additional whitespace).

The tessdata_fast resulted in "ce jour-là," which is correct.

Dunno what is going on with the two other trained data, or if I missed something, but they seem to produce nonsense.

Shreeshrii commented 5 years ago

The test image is white text on black background. If you invert it, all three traineddata give correct result.

Since tessdata_fast produces correct result in both cases, it seems to have been trained with samples with white text on black background too.

$ convert fra-black.jpg -negate fra.jpg

$ tesseract fra.jpg stdout -l fra --oem 1 --tessdata-dir ~/tessdata_fast
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 322
ce jour-là,
$ tesseract fra.jpg stdout -l fra --oem 1 --tessdata-dir ~/tessdata_best
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 322
ce jour-là,
$ tesseract fra.jpg stdout -l fra --oem 1 --tessdata-dir ~/tessdata
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 322
ce jour-là,

YamashitaRen commented 5 years ago

Nice find @Shreeshrii !

According to my testing, LSTM is still worse than Legacy. It seems to always? miss ellipses and italics.

Ellipses :

$ tesseract 00h03m04s200-00h03m07s920.jpg stdout -l fra --oem 1
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 387
Je n'ai jamais entendu parler
d'un tel lieu.

Italics :

$ tesseract 00h05m24s400-00h05m26s480.jpg stdout -l fra --oem 1 hocr > output.txt
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 464

output.txt

Apart theses, if you disregards how MUCH longer the processing takes, it seems to be on par with the Legacy engine.