Closed amitm02 closed 6 years ago
Hi Amit!
It works for me with -l best/heb
Thanks @amitdo! I'm not familiar with "best", what is it? Is there a documentation reference you can point me to?
Even better: -l best/Hebrew
The Hebrew.traineddata was trained for Hebrew and English.
https://github.com/tesseract-ocr/tessdata/tree/master/best
You can create 'best' folder in your 'tessdata' folder and put there the traineddata files you need from the above link.
Then you can use -l best/heb
What is the difference between "best/heb" traindata and the regular "heb" traindata?
What is the difference between "best/heb" and the regular "heb"?
When you use best/heb Tesseract uses /tessdatapath/tessdata/best/heb.traineddata. When you use heb it uses /tessdatapath/tessdata/heb.traineddata.
The best folder in https://github.com/tesseract-ocr/tessdata/tree/master contains the newest traineddata models.
You need to build Tesseract from the master branch.
Got it. So /best contains the newer, better models. best/Hebrew is the newer trained data for both Hebrew and English. best/heb is the newer trained data for Hebrew only.
This is very helpful @amitdo, i couldn't find this information in the wiki..
If you want, you can put the new files in /tessdatapath/tessdata and then use just -l heb or -l Hebrew.
Attached another example image: link Even, when using best trained data, problem pressist. "-l best/eng" extract the numbers. "-l best/Hebrew" does not extract the numbers, instead putting 0 where there should be a number.
I don't know why it fails to read the numbers in the second example.
If I erase the NIS, I get:
נזק ע"פ שמאי $450.00 שכ"ט שמאי 650.00
Hmm.. strange indeed. It repeats in many similar examples. Hopefully a fix will be found..
I think you need to fine-tune the model (training an existing model on new data without changing any part of the network) have a look at: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-impact
I was dive into trying to fine tune the model, but as a last check, i've looked into the segmentation marks via VietOCR UI front-end. My observations:
Hence, it seems to me that it is a software bug and not a network tweaking issue.
Let me first ask a question then try to explain how/why I came to this question.
Can this issue be related to the trained data for Hebrew language ?
Debugging the Tesseract sources to find a place where possible the correctly recognized digits are replaced by 0 didn't give result. Instead I found that in the alternates to the "Best Choice" which is returned by the tesseract there is no other choices for incorrectly recognized digits. For example the intermediate dumped alternate characters for the original image from this issue are: Alternates for "ח"ש": {"ח"ש"} Alternates for "0": {"0"} Alternates for "יאמש": {"יאמש"} Alternates for "פ"ע": {"פ"ע"} Alternates for "קזנ": {"קזנ"} Alternates for "ח"ש": {"ח"ש"} Alternates for "0": {"0"} Alternates for "יאמש": {"יאמש"} Alternates for "ט"כש": {"ט"כש"}
Then I tried some experiments: I run tesseract with '-l heb' and '-l Hebrew'. (Assuming that '-l Hebrew' is version for which training data contains the english data as well). Also I used different modification of the input images, and some other input images as well.
Concerning the original image modification. I tried with only digits remaining in the image ( )
The result for '-l Hebrew' is: 450.00 650.00
the result for '-l heb' is: 0 909000
Then I tried with added space between Hebrew characters and numbers ( )
The result for '-l Hebrew' is: נזק ע"פ שמאי 450.00 ש"ח
שכ"ט שמאי 650.00 ש"ח
the result for '-l heb' is: נזק ע"פ שמאי 0 ש"ח
שכ"ט שמאי 9090000 ש"ח
So expermients with other images containing different numbers in combination with Hebrew characters shows that '-l heb' outputs digits as expected and most of time it is correct. That is why the last thing which is probably laid under the reported issue (IMHO) is the Hebrew train data used for '-l heb'.
Also if someone knows other places where it is useful to debug the recognized characters to see whether they have initially correct and only later replaced by '0' or some other incorrect character, Please let me know.
It turns out it was indeed a training issue. After training "heb" can get the number correctly. Bug can be closed.
Bug can be closed.
You can close it yourself :-)
silly me :)
Attempting the OCR the following document document: link
Perhaps a bug related to RTL languages?
Tesseract Version: 4.00.00alpha platform: Darwin Kernel Version 16.7.0 (OXS)