tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.3k stars 9.52k forks source link

numbers are not extracted while using "-l heb+eng" but are extracted when using "-l eng" alone #1102

Closed amitm02 closed 6 years ago

amitm02 commented 7 years ago

Attempting the OCR the following document document: link

Perhaps a bug related to RTL languages?

Tesseract Version: 4.00.00alpha platform: Darwin Kernel Version 16.7.0 (OXS)

amitdo commented 7 years ago

Hi Amit!

It works for me with -l best/heb

amitm02 commented 7 years ago

Thanks @amitdo! I'm not familiar with "best", what is it? Is there a documentation reference you can point me to?

amitdo commented 7 years ago

Even better: -l best/Hebrew

The Hebrew.traineddata was trained for Hebrew and English.

amitdo commented 7 years ago

https://github.com/tesseract-ocr/tessdata/tree/master/best

You can create 'best' folder in your 'tessdata' folder and put there the traineddata files you need from the above link.

Then you can use -l best/heb

amitm02 commented 7 years ago

What is the difference between "best/heb" traindata and the regular "heb" traindata?

amitdo commented 7 years ago

What is the difference between "best/heb" and the regular "heb"?

When you use best/heb Tesseract uses /tessdatapath/tessdata/best/heb.traineddata. When you use heb it uses /tessdatapath/tessdata/heb.traineddata.

The best folder in https://github.com/tesseract-ocr/tessdata/tree/master contains the newest traineddata models.

You need to build Tesseract from the master branch.

amitm02 commented 7 years ago

Got it. So /best contains the newer, better models. best/Hebrew is the newer trained data for both Hebrew and English. best/heb is the newer trained data for Hebrew only.

This is very helpful @amitdo, i couldn't find this information in the wiki..

amitdo commented 7 years ago

If you want, you can put the new files in /tessdatapath/tessdata and then use just -l heb or -l Hebrew.

amitm02 commented 7 years ago

Attached another example image: link Even, when using best trained data, problem pressist. "-l best/eng" extract the numbers. "-l best/Hebrew" does not extract the numbers, instead putting 0 where there should be a number.

amitdo commented 7 years ago

I don't know why it fails to read the numbers in the second example.

amitdo commented 7 years ago

If I erase the NIS, I get:

נזק ע"פ שמאי $450.00 שכ"ט שמאי 650.00

test

test10

amitm02 commented 7 years ago

Hmm.. strange indeed. It repeats in many similar examples. Hopefully a fix will be found..

ghost commented 7 years ago

I think you need to fine-tune the model (training an existing model on new data without changing any part of the network) have a look at: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-impact

amitm02 commented 6 years ago

I was dive into trying to fine tune the model, but as a last check, i've looked into the segmentation marks via VietOCR UI front-end. My observations:

Hence, it seems to me that it is a software bug and not a network tweaking issue.

29920780-c7ddae28-8e57-11e7-8661-5b395f522851

screenshot 2017-11-28 11 58 07
arsenhakobyan commented 6 years ago

Let me first ask a question then try to explain how/why I came to this question.

Can this issue be related to the trained data for Hebrew language ?

Debugging the Tesseract sources to find a place where possible the correctly recognized digits are replaced by 0 didn't give result. Instead I found that in the alternates to the "Best Choice" which is returned by the tesseract there is no other choices for incorrectly recognized digits. For example the intermediate dumped alternate characters for the original image from this issue are: Alternates for "ח"ש": {"ח"ש"} Alternates for "0": {"0"} Alternates for "יאמש": {"יאמש"} Alternates for "פ"ע": {"פ"ע"} Alternates for "קזנ": {"קזנ"} Alternates for "ח"ש": {"ח"ש"} Alternates for "0": {"0"} Alternates for "יאמש": {"יאמש"} Alternates for "ט"כש": {"ט"כש"}

Then I tried some experiments: I run tesseract with '-l heb' and '-l Hebrew'. (Assuming that '-l Hebrew' is version for which training data contains the english data as well). Also I used different modification of the input images, and some other input images as well.

Concerning the original image modification. I tried with only digits remaining in the image ( modnumber)

The result for '-l Hebrew' is: 450.00 650.00

the result for '-l heb' is: 0 909000

Then I tried with added space between Hebrew characters and numbers ( mod)

The result for '-l Hebrew' is: נזק ע"פ שמאי 450.00 ש"ח

שכ"ט שמאי 650.00 ש"ח

the result for '-l heb' is: נזק ע"פ שמאי 0 ש"ח

שכ"ט שמאי 9090000 ש"ח


So expermients with other images containing different numbers in combination with Hebrew characters shows that '-l heb' outputs digits as expected and most of time it is correct. That is why the last thing which is probably laid under the reported issue (IMHO) is the Hebrew train data used for '-l heb'.

Also if someone knows other places where it is useful to debug the recognized characters to see whether they have initially correct and only later replaced by '0' or some other incorrect character, Please let me know.

amitm02 commented 6 years ago

It turns out it was indeed a training issue. After training "heb" can get the number correctly. Bug can be closed.

amitdo commented 6 years ago

Bug can be closed.

You can close it yourself :-)

amitm02 commented 6 years ago

silly me :)