tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.19k stars 9.5k forks source link

Random English Words in Bengali output file #2516

Open Sanaj2060 opened 5 years ago

Sanaj2060 commented 5 years ago

Input text (tiff format) ত্রিটমেন্টকী মথৌ তারি ঙসি মহাক্কী চাইরুংদা মফম অনি থোক্না ফ্লেকচর ওইরিবশিং অদু শেমজিন্নবা মেজর ওপরেসন অমা পাঙথোকখ্রে শান্নরোইসিগী ফিভম সরকার অসিনা য়াম্না কুপ্না য়েংশিল্লি হায়রি চেরোল অসিনা মখা তারক্লিবদি, ইন্দিয়া লৈঙাক অমদি মণিপুর লৈঙাক্না মচেৎ-মকাই ওইবা মওংদা নত্রগা প্রেসর কনবা ফুরুপকী ন্বাফমদা য়ুম্ফম ওইরগা নাকল অমত্তগী ওইবা ন্বারোইশিন থিরবদি Output.txt ত্রিটমেন্টকী মঘৌ তারি ঙসি মহাক্কী ISP Wey UN cals) Gros ওইরিবশিং অদু শেমজিন্নবা মেজর ওপরেসন অমা পাঙথোবখ্রে শন্নরোইসিগী ফিভম সরকার অসিনা য়াম্না কুপ্পা য়েংশিল্লি হায়রি চেরোল অসিনা মখা তারক্লিবদি, ইন্দিয়া লৈঙাক অমদি মণিপুর লৈঙার্না মচেৎঘ-মকাই ওইবা মওংদা নত্রগা প্রেসর কনবা ফুরুপকী ন্বাফমদা যুদ্ষম ওইরগা নাকল অমত্তগী ওইবা ন্বারোইশিন থিরবদি

Why there is random English words in the output?

Shreeshrii commented 5 years ago

Please provide a copy of the input tiff. Which traineddata file did you use? Which psm?

Bengali is ben+eng - but it should still not recognize Bangla as English.

Sanaj2060 commented 5 years ago

@Shreeshrii Here I'm using a fine tuned traineddata. And psm is default. files.zip

Shreeshrii commented 5 years ago

What is the result if you use the official traineddata from tessdata_best. Please try with both ben and script/Bengali.

Sanaj2060 commented 5 years ago

Using tessdata_best/ben , there are still some English words. Using tessdata_best/script/Bengali, it seems there is no random english words.

Shreeshrii commented 5 years ago

I'm using a fine tuned traineddata

How did you finetrain? Which traineddata did you use to start with? What was the error rate at end of training?

Sanaj2060 commented 5 years ago

I add some Manipuri words in the ben.training_text (tesstrain.sh with Lohit Bengali and Bangla Medium) and start with tessdata_best/ben.traineddata. After 100k iterations the error rate was 0.006.

Shreeshrii commented 5 years ago

Using tessdata_best/ben , there are still some English words.

So, the problem exists in the model you started with.

Please try your training using script/Bengali and see if that is any better.

Does your unicharset include English letters?

Shreeshrii commented 5 years ago

The config file is loading English as a sub-language.

tessedit_load_sublangs eng

Please remove that.

Sanaj2060 commented 5 years ago

If we removed the eng as sublangs, will it still recognise the English words in the input file. And regarding fine tuning using script/Bengali: There is a letter which is not used in Manipuri language. So, how can I remove or ignore the letter from the model?

Shreeshrii commented 5 years ago

If we removed the eng as sublangs, will it still recognise the English words in the input file.

No, it won't recognize English.

You can try using ben+eng if you need English.

There is a letter ৰ which is not used in Manipuri language. So, how can I remove or ignore the letter ৰ from the model?

Don't include it in your training text. Make sure it is not there in the unicharset.

Shreeshrii commented 5 years ago

I'm using a fine tuned traineddata. And psm is default.

Try with --psm 6 to see if there is any difference.

Shreeshrii commented 5 years ago

ref-man_1.txt

No English words without the sub language.

 tesseract ref.tiff ref-man_1 -l ben_man_1 --tessdata-dir ./
Error opening data file ./eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract Open Source OCR Engine v5.0.0-alpha-197-g65221 with Leptonica
Page 1
Page 2
Page 3
Page 4
Page 5
Page 6
Page 7
Page 8
Page 9
Page 10
Sanaj2060 commented 5 years ago

Accuracy using ocrevalutf8 accuracy correct.txt ocr.txt

  1. ben_fine_tuned.traineddata accuracy 92.36% (--psm 3)
  2. tessdata_best/script/Bengali.traineddata accuracy 92.41% (--psm 3)
  3. ben_fine_tuned.traineddata accuracy 92.76% (--psm 6)
  4. tessdata_best/script/Bengali.traineddata accuracy 92.83% (--psm 6)
Sanaj2060 commented 5 years ago

In ref-man_1.txt line 597: চাদা ৭০গী চাংদা ভোট থাদনখি ভােট মশিং থিবগী থবক মে ১৫দা পাঙথোক্কদবনি 13 149) 201] 8 থৌদাং পাউমীদগী 149) should be May

Sanaj2060 commented 5 years ago

@Shreeshrii I perform a fine tuning using the script/Bengali.traineddata. It has improved the ocr accuracy but there is still some random English words in the ocr.txt.

Shreeshrii commented 5 years ago

@Sanaj2060 What is the ocr accuracy with your new finetuned traineddata?

Please also provide the correct ground truth text for your test image ref.tiff in zip file.

Sanaj2060 commented 5 years ago

@Shreeshrii https://github.com/Sanaj2060/tesseract-ocr-Manipuri-Bengali/tree/master/evaluation%20report

man_2 has the highest accuracy of 93% as per the ocr-eval tool. Some characters like "য়" are counted as misclassified by the evaluation tool even if the model predict it correctly. So the accuracy rate can be estimated upto 95%±.

Shreeshrii commented 5 years ago

Thank you for sharing your results.

Some characters like "য়" are counted as misclassified by the evaluation tool even if the model predict it correctly

    1632        0   {য়}-{য়}

   1632        0   {\09DF }-{\09AF \09BC }

Tesseract is generating the normalized decomposed form for letters with nukta. You can change your ground truth correct text to use the decomposed form for all letters with nukta to get a more accurate report.