tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.04k stars 9.39k forks source link

could not find a matching blob error while training tesseract 4 on urdu language data #1885

Open ghost opened 6 years ago

ghost commented 6 years ago

accuracy of default trained model of urdu is not good, have a look on OCR of deafult model of urdu (urd.traineddata) by tesseract

خی رص رکااری تنانم کے مطالقی م رکز اور صوبہ تخیرپچشتون خو اویٹش تح یک انصاف نے میید ان مار

لیا۔ :تاب میں ون لیک آ گے آکے سے۔ صصوبہ سندرھ میں ٦ زا ٹین ےکامیالی حاص لکی۔ 0-7 ۔ و چتتان می ملا جلارجحان ہے۔ یہ تا وی ٹیں ج کی ئن الالتوائی میڈیانے بھی پیک یک تھی۔پامتان ٹل بھی زکارم کب رس خے کیہ اصمل متاللہ تح کیک انصاف اور فون لیگ یل ہو گا اور تح کیک انصا کا بڑ ابعار یر ےگا۔

here is the image which i used to perform ocr, text is much different from orignal.

2

ghost commented 6 years ago

@zdenop , @amitdo , @Shreeshrii can you guys please provide me tiff and box file used for training for the urd.traineddata. my email address is moen.eqbal@gmail.com.

amitdo commented 6 years ago

I don't have those files.

I wonder if Ray used some Urdu fonts with the Nastaliq style for training.

ghost commented 6 years ago

@amitdo thank you for your kind response, can you please mention ray here.

ghost commented 6 years ago

@theraysmith sir can you please provide me tiff and box file used for training for the urd.traineddata. my email address is moen.eqbal@gmail.com.

Shreeshrii commented 6 years ago

Urdu LSTM training text etc are available at https://github.com/tesseract-ocr/langdata_lstm/tree/master/urd

The fonts used for 3.04 are listed in https://github.com/tesseract-ocr/tesseract/blob/master/src/training/language-specific.sh#L552

Ray has not shared the fontlist for 4.0.0 training yet.

On Fri, Aug 31, 2018 at 4:57 PM, Mohammad Moin notifications@github.com wrote:

@theraysmith https://github.com/theraysmith sir can you please provide me tiff and box file used for training for the urd.traineddata. my email address is moen.eqbal@gmail.com.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1885#issuecomment-417635836, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o8hJbd5F8skQcUzebRKr_oKH8ZjBks5uWR2TgaJpZM4WTXe- .

--


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shreeshrii commented 6 years ago

here is the image which i used to perform ocr, text is much different from original.

Please also provide the correct text (ground truth) for the image for testing.

ghost commented 6 years ago

@Shreeshrii thank you so much for your response, actually, I am working to improve the accuracy of the model, box file will help me to understand the creating boxes around the characters manually for training, I am also facing 2 issue when I am trying to train my custom model.

  1. the model is not recognizing the spaces b/w the words.
  2. model is showing the text in LTR form (Urdu is RTL language, similar to Arabic)

I asked for help in issue #1832 but @zdenop closed that b/c the question was not an issue, i was asking for support, so now i want to go through the training process followed by the tesseract developers.

Urdu LSTM training text etc are available at https://github.com/tesseract-ocr/langdata_lstm/tree/master/urd

link you provided does not contain the box file, it only have text file with urdu data which is used to train the model.

please have a look in following link, tesseract team has provided tiff/box pairs for some lang data. https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-%E2%80%93-Make-Box-Files#tifbox-pairs-provided.

Note: i am using Tesseract 4.

Shreeshrii commented 6 years ago

If you use tesstrain.sh it will create the box/tif pairs correctly for RTL languages also.

You can use --save_box_tiff with the command. Please build teseract using the latest code (beta.4) from Github.

ghost commented 6 years ago

@Shreeshrii , I've already trained my own model using my own tif/box file but the result is worse than the original trained model.

This is the image: 2

and this is the result:

ﻧﺮگﻛریﯾﯾﺎﺘﯿﮯﻮزﻧﺎﻧﺎﺮگﻟاورﺻﻮﺑﺑﻮﺮﻛﻧﺘنﯿواہﯿںﻧﺮﻛﮏاﻧﺼﺎﻑ۔ﮯﻣﮏرانﺎر

ن۔ﯾﯾﯿﺎبﯿںﻧننﮏ ﺁﮯ ﺁﮯﮪ۔ ﺻﻮﺑﻄﺪﮪﯿں ﯿﮯﻟﺎری۔ﮯﻛﻣرﺎیﺣﺎگﺻﻞگی۔ ﺮاﯾیﯿںﺸﮏوگﻮناﻢﻛﺑاﻢﻛﺻﮪﺎﺎﮨﻮﻛگﺎ ﺎﻮﻮﮧﺎںﻣںﺎﻻﺣﻼرﻘﺎںﮪ۔ﺪﯾﺎگﯿ وﺻیﮨں ﻮںﻛوﺎﯿںاﻟوﻟﻮایﻣﯾرﻑگوﺎﮨﻧﻮیﻛوﺎیوﺎ۔ﻔﺮﺎنﯿںگوﺎﺰﯾﻛرﯿہوﺎﻛﮨﮧرﮪہﮯ ﻛﮧاﺻﻞﻣﯿاﺎﺎﺼﯾگﺮﮯﮏاﻧﺼﺎﻑاورﻧنﺸﯾﮏﯿںﮨﻮﻛاورﯾگﺮﮯﮏاﻧﺼﺎﻑﻛﺒﻟاﺑﺎریرﮪﺸﻛ۔

If you can share the tif/box files with me, which is not in the link you have provided, it will help me find the problem.

Any thoughts?

Thanks

P.S. Is it possible to privately contact you for help.

Shreeshrii commented 6 years ago

If you can share the tif/box files with me, which is not in the link you have provided, it will help me find the problem.

The box/tiff files have NOT been provided by Google. You can run the scripts on the training text to generate them.

Shreeshrii commented 6 years ago

and this is the result:

I asked for the ground truth ie. the correct text matching that image for testing of the various urd.traineddata files (tessdata, tessdata_best, tessdata_fast) and also the Arabic.traineddata.

ghost commented 6 years ago

@Shreeshrii here is the correct text.

غیر سرکاری نتائج کے مطابق مرکز اور صوبہ خیبر پختون خواہ میں تحریک انصاف نے میدان مار لیا۔ پنجاب میں نون لیگ آگے آگے ہے۔ صوبہ سندھ میں پیپلز پارٹی نے کامیابی حاصل کی۔ کراچی میں ملک دشمن ایم کیو ایم کا صفایا ہو گیا۔ بلوچستان میں ملا جلار جحان ہے۔ یہ نتائج وہی ہیں جس کی بین الاقوامی میڈیا نے بھی پیشگوئی کی تھی۔ پاکستان میں بھی تجزیہ کار یہی کہہ رہے تھے کہ اصل مقابلہ تحریک انصاف اور نون لیگ میں ہو گا اور تحریک انصاف کا پلڑا بھاری رہے گا۔

Shreeshrii commented 6 years ago

Thanks! Do you know what font was used for the image?

2018-08-31 19:27 GMT+05:30 Mohammad Moin notifications@github.com:

here is the correct text.

غیر سرکاری نتائج کے مطابق مرکز اور صوبہ خیبر پختون خواہ میں تحریک انصاف نے میدان مار لیا۔ پنجاب میں نون لیگ آگے آگے ہے۔ صوبہ سندھ میں پیپلز پارٹی نے کامیابی حاصل کی۔ کراچی میں ملک دشمن ایم کیو ایم کا صفایا ہو گیا۔ بلوچستان میں ملا جلار جحان ہے۔ یہ نتائج وہی ہیں جس کی بین الاقوامی میڈیا نے بھی پیشگوئی کی تھی۔ پاکستان میں بھی تجزیہ کار یہی کہہ رہے تھے کہ اصل مقابلہ تحریک انصاف اور نون لیگ میں ہو گا اور تحریک انصاف کا پلڑا بھاری رہے گا۔

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1885#issuecomment-417672284, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o37earCPu-C1zxVdiRaHAT1CS-A3ks5uWUDQgaJpZM4WTXe- .

--


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

ghost commented 6 years ago

@Shreeshrii font used in this image is "Nastaliq" font, but I have other 463 pages which I have to OCR, those pages are the mix up of "Nastaliq" and "majalla" font style, here is the sample of the page which I want to OCR.

sample

following is the correct text of this image. عبدالکریم | اشرف حسین | 9-7220602-42000 | 22 | مکان نمبر 80 محله گلشن ضیاء گلی نمبر 1 لیاقت چوک اورنگی ٹاون، ضلع کراچی غربی

please look in the following image, font inside red box is "majalla" which can be ignored because the text in the red box is the copy of " عبدالکریم | اشرف حسین " written in majalla font style.

44946620-46cc2700-ae19-11e8-815b-370ef3a375d5

ghost commented 6 years ago

Hello @Shreeshrii any solution or suggestion?

ghost commented 6 years ago

@Shreeshrii please also let me know why the custom trained model is not recognizing space b/w letters, please check the following image (these are the alphabet of Urdu language). shot the output of this image should be like this ا ب پ ت ٹ ث ج چ ح خ د ڈ ذ ر ڑ ز ژ س ش ص ض ط ظ ع غ ف ق ک ک ل م ن د ہ ھ ء ی ے but the custom trained model for the Urdu language is showing the following output (i used to train and test the model on the same image). ابپتٹثجچح خدڈذرڑزژسشصض طظعغفقککل

من دہھءیے

the output is missing whitespace b/w characters if we manually add whitespace the output will look like this.

ا ب پ ت ٹ ث ج چ ح خ د ڈ ذ ر ڑ ز ژ س ش ص ض ط ظ ع غ ف ق ک ک ل م ن د ہ ھ ء ی ے

even the default model for the Urdu language which is included in tesseract 4.0 is showing the following output.

اب پت ٹ ث ىىغً دڈوڈرڑز مس شی صصضص اف یقکگلگل

و دو ءگی ١ے

font used in the image is Nastaliq.

stweil commented 5 years ago

Meanwhile the font list used for training was added to the repository.

kafeelbutt commented 3 years ago

@moeneqbal had you trained the model properly? I wanted to use urdu ocr for detecting the text from CNIC backside. May you please help me by giving the solution at kafeelbutt45@gmail.com?. It really means alot.

ghost commented 3 years ago

@kafeelbutt no dear, the tesseract is not supporting the Urdu language completely there are still many issues in trained data, you can use google OCR, which provides a much better result.

kafeelbutt commented 3 years ago

Google OCR and tesseract are different?

On Thu, Mar 4, 2021, 7:29 PM Mohammad Moin notifications@github.com wrote:

@kafeelbutt https://github.com/kafeelbutt no dear, the tesseract is not supporting the Urdu language completely there are still many issues in trained data, you can use google OCR, which provides a much better result.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1885#issuecomment-790657517, or unsubscribe https://github.com/notifications/unsubscribe-auth/AK53NG6ZDOI7BQHDEJDQLE3TB6KMFANCNFSM4FSNO67A .

precise-sajid commented 3 years ago

Any luck for urdu ocr training