Open Sanaj2060 opened 5 years ago
Please provide a copy of the input tiff. Which traineddata file did you use? Which psm?
Bengali is ben+eng
- but it should still not recognize Bangla as English.
@Shreeshrii Here I'm using a fine tuned traineddata. And psm is default. files.zip
What is the result if you use the official traineddata from tessdata_best. Please try with both ben
and script/Bengali
.
Using tessdata_best/ben
, there are still some English words.
Using tessdata_best/script/Bengali
, it seems there is no random english words.
I'm using a fine tuned traineddata
How did you finetrain? Which traineddata did you use to start with? What was the error rate at end of training?
I add some Manipuri words in the ben.training_text (tesstrain.sh with Lohit Bengali
and Bangla Medium
) and start with tessdata_best/ben.traineddata
.
After 100k iterations the error rate was 0.006.
Using tessdata_best/ben , there are still some English words.
So, the problem exists in the model you started with.
Please try your training using script/Bengali and see if that is any better.
Does your unicharset include English letters?
The config file is loading English as a sub-language.
tessedit_load_sublangs eng
Please remove that.
If we removed the eng as sublangs, will it still recognise the English words in the input file.
And regarding fine tuning using script/Bengali: There is a letter ৰ
which is not used in Manipuri language. So, how can I remove or ignore the letter ৰ
from the model?
If we removed the eng as sublangs, will it still recognise the English words in the input file.
No, it won't recognize English.
You can try using ben+eng
if you need English.
There is a letter ৰ which is not used in Manipuri language. So, how can I remove or ignore the letter ৰ from the model?
Don't include it in your training text. Make sure it is not there in the unicharset.
I'm using a fine tuned traineddata. And psm is default.
Try with --psm 6 to see if there is any difference.
No English words without the sub language.
tesseract ref.tiff ref-man_1 -l ben_man_1 --tessdata-dir ./
Error opening data file ./eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract Open Source OCR Engine v5.0.0-alpha-197-g65221 with Leptonica
Page 1
Page 2
Page 3
Page 4
Page 5
Page 6
Page 7
Page 8
Page 9
Page 10
Accuracy using ocrevalutf8 accuracy correct.txt ocr.txt
ben_fine_tuned.traineddata
accuracy 92.36% (--psm 3
)tessdata_best/script/Bengali.traineddata
accuracy 92.41% (--psm 3
)ben_fine_tuned.traineddata
accuracy 92.76% (--psm 6
)tessdata_best/script/Bengali.traineddata
accuracy 92.83% (--psm 6
)In ref-man_1.txt
line 597:
চাদা ৭০গী চাংদা ভোট থাদনখি ভােট মশিং থিবগী থবক মে ১৫দা পাঙথোক্কদবনি 13 149) 201] 8 থৌদাং পাউমীদগী
149) should be May
@Shreeshrii I perform a fine tuning using the script/Bengali.traineddata
. It has improved the ocr accuracy but there is still some random English words in the ocr.txt.
@Sanaj2060 What is the ocr accuracy with your new finetuned traineddata?
Please also provide the correct ground truth text for your test image ref.tiff in zip file.
@Shreeshrii https://github.com/Sanaj2060/tesseract-ocr-Manipuri-Bengali/tree/master/evaluation%20report
man_2 has the highest accuracy of 93% as per the ocr-eval tool. Some characters like "য়" are counted as misclassified by the evaluation tool even if the model predict it correctly. So the accuracy rate can be estimated upto 95%±.
Thank you for sharing your results.
Some characters like "য়" are counted as misclassified by the evaluation tool even if the model predict it correctly
1632 0 {য়}-{য়}
1632 0 {\09DF }-{\09AF \09BC }
Tesseract is generating the normalized decomposed form for letters with nukta. You can change your ground truth correct text to use the decomposed form for all letters with nukta to get a more accurate report.
Input text (tiff format)
ত্রিটমেন্টকী মথৌ তারি ঙসি মহাক্কী চাইরুংদা মফম অনি থোক্না ফ্লেকচর ওইরিবশিং অদু শেমজিন্নবা মেজর ওপরেসন অমা পাঙথোকখ্রে শান্নরোইসিগী ফিভম সরকার অসিনা য়াম্না কুপ্না য়েংশিল্লি হায়রি চেরোল অসিনা মখা তারক্লিবদি, ইন্দিয়া লৈঙাক অমদি মণিপুর লৈঙাক্না মচেৎ-মকাই ওইবা মওংদা নত্রগা প্রেসর কনবা ফুরুপকী ন্বাফমদা য়ুম্ফম ওইরগা নাকল অমত্তগী ওইবা ন্বারোইশিন থিরবদি
Output.txtত্রিটমেন্টকী মঘৌ তারি ঙসি মহাক্কী ISP Wey UN cals) Gros ওইরিবশিং অদু শেমজিন্নবা মেজর ওপরেসন অমা পাঙথোবখ্রে শন্নরোইসিগী ফিভম সরকার অসিনা য়াম্না কুপ্পা য়েংশিল্লি হায়রি চেরোল অসিনা মখা তারক্লিবদি, ইন্দিয়া লৈঙাক অমদি মণিপুর লৈঙার্না মচেৎঘ-মকাই ওইবা মওংদা নত্রগা প্রেসর কনবা ফুরুপকী ন্বাফমদা যুদ্ষম ওইরগা নাকল অমত্তগী ওইবা ন্বারোইশিন থিরবদি