Open ghost opened 6 years ago
@zdenop , @amitdo , @Shreeshrii can you guys please provide me tiff and box file used for training for the urd.traineddata. my email address is moen.eqbal@gmail.com.
I don't have those files.
I wonder if Ray used some Urdu fonts with the Nastaliq style for training.
@amitdo thank you for your kind response, can you please mention ray here.
@theraysmith sir can you please provide me tiff and box file used for training for the urd.traineddata. my email address is moen.eqbal@gmail.com.
Urdu LSTM training text etc are available at https://github.com/tesseract-ocr/langdata_lstm/tree/master/urd
The fonts used for 3.04 are listed in https://github.com/tesseract-ocr/tesseract/blob/master/src/training/language-specific.sh#L552
Ray has not shared the fontlist for 4.0.0 training yet.
On Fri, Aug 31, 2018 at 4:57 PM, Mohammad Moin notifications@github.com wrote:
@theraysmith https://github.com/theraysmith sir can you please provide me tiff and box file used for training for the urd.traineddata. my email address is moen.eqbal@gmail.com.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1885#issuecomment-417635836, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o8hJbd5F8skQcUzebRKr_oKH8ZjBks5uWR2TgaJpZM4WTXe- .
--
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
here is the image which i used to perform ocr, text is much different from original.
Please also provide the correct text (ground truth) for the image for testing.
@Shreeshrii thank you so much for your response, actually, I am working to improve the accuracy of the model, box file will help me to understand the creating boxes around the characters manually for training, I am also facing 2 issue when I am trying to train my custom model.
I asked for help in issue #1832 but @zdenop closed that b/c the question was not an issue, i was asking for support, so now i want to go through the training process followed by the tesseract developers.
Urdu LSTM training text etc are available at https://github.com/tesseract-ocr/langdata_lstm/tree/master/urd
link you provided does not contain the box file, it only have text file with urdu data which is used to train the model.
please have a look in following link, tesseract team has provided tiff/box pairs for some lang data. https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-%E2%80%93-Make-Box-Files#tifbox-pairs-provided.
Note: i am using Tesseract 4.
If you use tesstrain.sh it will create the box/tif pairs correctly for RTL languages also.
You can use --save_box_tiff with the command. Please build teseract using the latest code (beta.4) from Github.
@Shreeshrii , I've already trained my own model using my own tif/box file but the result is worse than the original trained model.
This is the image:
and this is the result:
ﻧﺮگﻛریﯾﯾﺎﺘﯿﮯﻮزﻧﺎﻧﺎﺮگﻟاورﺻﻮﺑﺑﻮﺮﻛﻧﺘنﯿواہﯿںﻧﺮﻛﮏاﻧﺼﺎﻑ۔ﮯﻣﮏرانﺎر
ن۔ﯾﯾﯿﺎبﯿںﻧننﮏ ﺁﮯ ﺁﮯﮪ۔ ﺻﻮﺑﻄﺪﮪﯿں ﯿﮯﻟﺎری۔ﮯﻛﻣرﺎیﺣﺎگﺻﻞگی۔ ﺮاﯾیﯿںﺸﮏوگﻮناﻢﻛﺑاﻢﻛﺻﮪﺎﺎﮨﻮﻛگﺎ ﺎﻮﻮﮧﺎںﻣںﺎﻻﺣﻼرﻘﺎںﮪ۔ﺪﯾﺎگﯿ وﺻیﮨں ﻮںﻛوﺎﯿںاﻟوﻟﻮایﻣﯾرﻑگوﺎﮨﻧﻮیﻛوﺎیوﺎ۔ﻔﺮﺎنﯿںگوﺎﺰﯾﻛرﯿہوﺎﻛﮨﮧرﮪہﮯ ﻛﮧاﺻﻞﻣﯿاﺎﺎﺼﯾگﺮﮯﮏاﻧﺼﺎﻑاورﻧنﺸﯾﮏﯿںﮨﻮﻛاورﯾگﺮﮯﮏاﻧﺼﺎﻑﻛﺒﻟاﺑﺎریرﮪﺸﻛ۔
If you can share the tif/box files with me, which is not in the link you have provided, it will help me find the problem.
Any thoughts?
Thanks
P.S. Is it possible to privately contact you for help.
If you can share the tif/box files with me, which is not in the link you have provided, it will help me find the problem.
The box/tiff files have NOT been provided by Google. You can run the scripts on the training text to generate them.
and this is the result:
I asked for the ground truth ie. the correct text matching that image for testing of the various urd.traineddata files (tessdata, tessdata_best, tessdata_fast) and also the Arabic.traineddata.
@Shreeshrii here is the correct text.
غیر سرکاری نتائج کے مطابق مرکز اور صوبہ خیبر پختون خواہ میں تحریک انصاف نے میدان مار لیا۔ پنجاب میں نون لیگ آگے آگے ہے۔ صوبہ سندھ میں پیپلز پارٹی نے کامیابی حاصل کی۔ کراچی میں ملک دشمن ایم کیو ایم کا صفایا ہو گیا۔ بلوچستان میں ملا جلار جحان ہے۔ یہ نتائج وہی ہیں جس کی بین الاقوامی میڈیا نے بھی پیشگوئی کی تھی۔ پاکستان میں بھی تجزیہ کار یہی کہہ رہے تھے کہ اصل مقابلہ تحریک انصاف اور نون لیگ میں ہو گا اور تحریک انصاف کا پلڑا بھاری رہے گا۔
Thanks! Do you know what font was used for the image?
2018-08-31 19:27 GMT+05:30 Mohammad Moin notifications@github.com:
here is the correct text.
غیر سرکاری نتائج کے مطابق مرکز اور صوبہ خیبر پختون خواہ میں تحریک انصاف نے میدان مار لیا۔ پنجاب میں نون لیگ آگے آگے ہے۔ صوبہ سندھ میں پیپلز پارٹی نے کامیابی حاصل کی۔ کراچی میں ملک دشمن ایم کیو ایم کا صفایا ہو گیا۔ بلوچستان میں ملا جلار جحان ہے۔ یہ نتائج وہی ہیں جس کی بین الاقوامی میڈیا نے بھی پیشگوئی کی تھی۔ پاکستان میں بھی تجزیہ کار یہی کہہ رہے تھے کہ اصل مقابلہ تحریک انصاف اور نون لیگ میں ہو گا اور تحریک انصاف کا پلڑا بھاری رہے گا۔
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1885#issuecomment-417672284, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o37earCPu-C1zxVdiRaHAT1CS-A3ks5uWUDQgaJpZM4WTXe- .
--
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
@Shreeshrii font used in this image is "Nastaliq" font, but I have other 463 pages which I have to OCR, those pages are the mix up of "Nastaliq" and "majalla" font style, here is the sample of the page which I want to OCR.
following is the correct text of this image. عبدالکریم | اشرف حسین | 9-7220602-42000 | 22 | مکان نمبر 80 محله گلشن ضیاء گلی نمبر 1 لیاقت چوک اورنگی ٹاون، ضلع کراچی غربی
please look in the following image, font inside red box is "majalla" which can be ignored because the text in the red box is the copy of " عبدالکریم | اشرف حسین " written in majalla font style.
Hello @Shreeshrii any solution or suggestion?
@Shreeshrii please also let me know why the custom trained model is not recognizing space b/w letters, please check the following image (these are the alphabet of Urdu language). the output of this image should be like this ا ب پ ت ٹ ث ج چ ح خ د ڈ ذ ر ڑ ز ژ س ش ص ض ط ظ ع غ ف ق ک ک ل م ن د ہ ھ ء ی ے but the custom trained model for the Urdu language is showing the following output (i used to train and test the model on the same image). ابپتٹثجچح خدڈذرڑزژسشصض طظعغفقککل
من دہھءیے
the output is missing whitespace b/w characters if we manually add whitespace the output will look like this.
ا ب پ ت ٹ ث ج چ ح خ د ڈ ذ ر ڑ ز ژ س ش ص ض ط ظ ع غ ف ق ک ک ل م ن د ہ ھ ء ی ے
even the default model for the Urdu language which is included in tesseract 4.0 is showing the following output.
اب پت ٹ ث ىىغً دڈوڈرڑز مس شی صصضص اف یقکگلگل
و دو ءگی ١ے
font used in the image is Nastaliq.
@moeneqbal had you trained the model properly? I wanted to use urdu ocr for detecting the text from CNIC backside. May you please help me by giving the solution at kafeelbutt45@gmail.com?. It really means alot.
@kafeelbutt no dear, the tesseract is not supporting the Urdu language completely there are still many issues in trained data, you can use google OCR, which provides a much better result.
Google OCR and tesseract are different?
On Thu, Mar 4, 2021, 7:29 PM Mohammad Moin notifications@github.com wrote:
@kafeelbutt https://github.com/kafeelbutt no dear, the tesseract is not supporting the Urdu language completely there are still many issues in trained data, you can use google OCR, which provides a much better result.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1885#issuecomment-790657517, or unsubscribe https://github.com/notifications/unsubscribe-auth/AK53NG6ZDOI7BQHDEJDQLE3TB6KMFANCNFSM4FSNO67A .
Any luck for urdu ocr training
accuracy of default trained model of urdu is not good, have a look on OCR of deafult model of urdu (urd.traineddata) by tesseract
خی رص رکااری تنانم کے مطالقی م رکز اور صوبہ تخیرپچشتون خو اویٹش تح یک انصاف نے میید ان مار
لیا۔ :تاب میں ون لیک آ گے آکے سے۔ صصوبہ سندرھ میں ٦ زا ٹین ےکامیالی حاص لکی۔ 0-7 ۔ و چتتان می ملا جلارجحان ہے۔ یہ تا وی ٹیں ج کی ئن الالتوائی میڈیانے بھی پیک یک تھی۔پامتان ٹل بھی زکارم کب رس خے کیہ اصمل متاللہ تح کیک انصاف اور فون لیگ یل ہو گا اور تح کیک انصا کا بڑ ابعار یر ےگا۔
here is the image which i used to perform ocr, text is much different from orignal.