Open Shreeshrii opened 7 years ago
@aidinkrmz
Are B Nazanin , B Roya unicode fonts? Please try OCR with the latest BEST traineddata.
@theraysmith
Does language-specific.sh have the current list of fonts used by your BEST training?
Are B Nazanin , B Roya unicode fonts? Please try OCR with the latest BEST traineddata.
Those names usually indicating a family of fonts, if you can train a new set of font for Persian, from these page which provides OFL licensed fonts and used heavily on Persian Tex community, download these: XB Zar, XB Roya (equal to B Roya), XB Kayhan (somehow equal to B Nazanin).
Also here is another FOSS licensed Persian font bundle contains both Roya and Nazanin (under Nazli name). These ones have less glyph coverage but somehow more standard compliant.
You can have B Nazanin and B Roya themselves also but they are not released under a FOSS license, if that matters.
Please try OCR with the latest BEST traineddata.
Is LSTM based Persian traineddata released recently? How we can have a look?
@Shreeshrii ofcourse they are unicode fonts and also more than 90% percent of texts use this font family like B NAZANIN , B yaGHoot , B zar
@ebraminio salam zaheran shoma irani hastid bezarin ma moshkelemono ba shoma dar miyan bezarim traindatayi ke alan ma dar ekhtiyar darim ba fonte arial fgt dorost kar mikonan va matn hayi ke ba font haye irani mesle B nazanin ya B zar neveshte mishe dorost javab nemidan va kar nemikonan be nazareton chare chiye rahe hali hast?
I tested version 4 for attached image. it has these problems 1-doesn't recognize ZWNJ 2-doesn't recognize ● 3-has problem with Ligatures like لا 4- the image's font is B_nazanin 5-doesn't recognize ، ؛ ؟ (\u060C \u061B \u061F) 6- it's dictionary is not completed I suggest to use persian hunspell's data for example it doesn't recognize (ساخت - ناشی ) this data use by chrome (for more information look here)
Please see https://github.com/tesseract-ocr/tessdata/blob/master/best/fas.traineddata
for the 4.0alpha best model for persian, uploaded by @theraysmith just a few days back.
Your feedback will help him improve the next version of training for beta release.
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Sat, Aug 5, 2017 at 3:27 PM, mohammad reza notifications@github.com wrote:
I tested version 4 for attached image. it has these problems 1-doesn't recognize ZWNJ https://en.wikipedia.org/wiki/Zero-width_non-joiner 2-doesn't recognize ● 3-has problem with Ligatures like لا https://en.wikipedia.org/wiki/Arabic_alphabet#Ligatures
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata/issues/70#issuecomment-320434564, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_ozQyB1glSITgP-DcG29KGBcefIpJks5sVDyngaJpZM4OuZjP .
@reza1615 use vietocr 5 alpha you can get good result with it
@aidinkrmz also I tested vietocr 5 alpha. it has the same problem (you can test the attached image). vietocr is only an interface.
Everyone who tests the new best traineddata should also update Tesseract to the latest commit.
is it the last version? tesseract-ocr-setup-4.0.0-alpha.20170804.exe I used this version and downloaded the data from the installation wizard
Yes, that version is the last version.
@reza1615 but i test your pic without any error !!!!!!!!
@aidinkrmz for me it recognizes ساختمانها as ساختمانها or ● as @ or لا as ا like بالاتری as باالتری
@reza1615 agha mohammad reza aziz 2 3 % khata ghabele hale kamel ke dg nabayad tashkhis bede ke on sakhtoman ha ham 2 3 darsad irade
@aidinkrmz you said it doesn't have any error!! now you says it has 2-3 % errors! I didn't say tesseract-ocr has fatal bug I said it should recognize these problems when these bugs could solve why we shouldn't solve them at the program? also, it doesn't recognize ZWNJ and it isn't a minor bug.
@reza1615 we talk abut traindata file so the traindata file doesn`t have any problem
@reza1615
it doesn't recognize ZWNJ and it isn't a minor bug.
Please elaborate on that with some examples and any suggestions you have for fixing. Thanks!
@Shreeshrii At this image yellow: doesn't recognize Zwnj Red: problem with Ligature لا Green: problem with ● Blue: confused . (dot) and ۰ (persian 0) Perpul: incorrect word
I attached the output file to compare them
@Shreeshrii: I see LSTM result so convincing here. Keep up the good work :)
Question from Ray in https://github.com/tesseract-ocr/langdata/issues/72
Anyone know which digits are needed for the other Arabic languages? kur_ara, pus, uig
Hi shree i tested BEST fas.traineddata. but it has problem. "لا".it can't recognize it and i did fine tune for "لا". and again had this problem and about xheight, i used Arabic.xheight for persian but after creating unicharset, this file had cleared.
@theraysmith Any update regarding the new training for RTL languages?
Hi guys . I'm trying to find the traineddata fot arabic numbers only , can any body guide me where to find it thank you
im using tesseract 4 visual studio 2017 c++ i tried using the normal ara.traindata it doesn't seems okay at all
the results الرقم /1 ١ ١5 .//ا1١1؟؟ @theraysmith @Shreeshrii
Hi. I tested BEST fas.traineddata but it had some errors. for example it couldn't recognize 'ی' character for some fonts. I integrate some specific fonts such as "B Nazanin" "B Zar" "B Lotus" by fine tuning the pre-training model. After testing the new .traineddata, It could recognize some fonts better than the BEST fas.traineddata. but it couldn't recognize ZWNJ. However with BEST fas.traineddata I could recognize ZWNJ.Nnow my questions are: 1- What fonts did you used for training and making the BEST fas.traineddata? 2- How should I training tesseract-ocr 4.0 to recognize ZWNJ in Persian language?
@AbdelsalamHaa
Did you try with script/Arabic.traineddata?
@farhodi What training text did you use for fine tuning? did it have any ZWNJ in it?
Take a look at the unicharset from your trained data and compare with the one in the repository.
Make sure you have all needed characters in your training text.
Regarding the font list and training done originally by Ray Smith, we are awaiting updates to langdata.
hi, i test new version of tesseract (4 beta) on persian language. the results its good but there are some errors. for examples : 1) in different char that have same shape with different dot location or number of dots. (ex. بـ تـ ثـ یـ or ز ر ژ) 2) in some cases when there are same word in doc, the results of these same words are different. (ex. word="نویسه") 3) i think at the end of process, does not apply dictionary correction, is it true ? 4) and how could we train more fonts ?
thanks
@Shreeshrii I've used the same training text in langdata "fas" folder for fine tuning. Just add new fonts for training. Also I couldn't find fas.unicharset at langdata repository to compare with my .unicharset. The .unicharset I've used is as follow:
108 NULL 0 NULL 0 Joined 7 0,255,0,255,0,0,0,0,0,0 Latin 1 0 1 Joined # Joined [4a 6f 69 6e 65 64 ]a |Broken|0|1 f 0,255,0,255,0,0,0,0,0,0 Common 2 10 2 |Broken|0|1 # Broken و 1 0,68,137,238,65,290,0,27,62,256 Arabic 3 13 3 و # و [648 ]x ه 1 55,123,147,255,35,181,6,64,48,222 Arabic 4 13 4 ه # ه [647 ]x ک 1 47,121,200,255,131,288,0,45,124,305 Arabic 5 13 5 ک # ک [6a9 ]x ن 1 0,88,163,255,68,321,0,52,76,354 Arabic 6 13 6 ن # ن [646 ]x ی 1 0,71,148,225,95,253,0,45,103,279 Arabic 7 13 7 ی # ی [6cc ]x ا 1 26,117,200,255,11,181,7,82,33,222 Arabic 8 13 8 ا # ا [627 ]x خ 1 0,66,172,255,92,262,2,37,84,290 Arabic 9 13 9 خ # خ [62e ]x س 1 0,64,140,228,123,493,0,50,132,523 Arabic 10 13 10 س # س [633 ]x ع 1 0,64,148,255,98,239,2,37,81,276 Arabic 11 13 11 ع # ع [639 ]x ض 1 0,64,174,255,131,619,0,50,132,654 Arabic 12 13 12 ض # ض [636 ]x م 1 0,64,134,241,51,272,0,46,56,313 Arabic 13 13 13 م # م [645 ]x ل 1 0,96,200,255,62,328,0,50,71,332 Arabic 14 13 14 ل # ل [644 ]x ف 1 44,125,202,255,113,339,0,47,123,378 Arabic 15 13 15 ف # ف [641 ]x ر 1 0,63,137,224,45,297,0,22,59,244 Arabic 16 13 16 ر # ر [631 ]x پ 1 0,42,142,217,113,258,2,50,123,288 Arabic 17 13 17 پ # پ [67e ]x د 1 49,123,163,250,43,467,0,70,59,503 Arabic 18 13 18 د # د [62f ]x ت 1 58,123,170,255,113,339,2,50,123,378 Arabic 19 13 19 ت # ت [62a ]x . 10 12,108,64,140,18,52,9,77,52,193 Common 20 6 20 . # . [2e ]p ج 1 0,64,133,255,92,262,2,37,84,290 Arabic 21 13 21 ج # ج [62c ]x ق 1 0,79,179,255,84,310,0,52,88,345 Arabic 22 13 22 ق # ق [642 ]x ش 1 0,64,196,255,123,493,0,50,132,523 Arabic 23 13 23 ش # ش [634 ]x ز 1 0,63,167,255,45,298,0,22,59,242 Arabic 24 13 24 ز # ز [632 ]x : 10 12,108,157,255,18,58,11,77,52,193 Common 25 6 25 : # : [3a ]p ب 1 0,71,140,224,113,339,0,50,123,378 Arabic 26 13 26 ب # ب [628 ]x آ 1 26,117,230,255,36,161,0,58,33,198 Arabic 27 13 27 آ # آ [622 ]x ي 1 0,56,148,255,95,431,0,45,103,467 Arabic 28 13 28 ي # ي [64a ]x گ 1 47,125,208,255,131,289,0,45,132,305 Arabic 29 13 29 گ # گ [6af ]x , 10 0,72,69,140,21,62,8,65,39,193 Common 30 6 30 , # , [2c ]p غ 1 0,64,196,255,98,239,2,37,81,276 Arabic 31 13 31 غ # غ [63a ]x ح 1 0,64,133,255,92,262,2,37,84,290 Arabic 32 13 32 ح # ح [62d ]x = 0 86,150,160,244,90,218,3,33,99,262 Common 33 10 33 = # = [3d ] } 10 0,67,210,255,37,125,4,46,54,193 Common 34 10 86 } # } [7d ]p ك 1 49,123,203,255,91,451,0,50,103,483 Arabic 35 13 35 ك # ك [643 ]x / 10 12,102,224,255,43,166,0,29,54,193 Common 36 6 36 / # / [2f ]p ٧ 8 58,125,181,255,70,211,0,65,87,270 Common 37 5 37 ٧ # ٧ [667 ]0 ٨ 8 58,123,179,255,70,235,0,65,88,270 Common 38 5 38 ٨ # ٨ [668 ]0 ٣ 8 55,121,184,255,71,235,0,65,88,338 Common 39 5 39 ٣ # ٣ [663 ]0 ١ 8 55,121,184,255,13,134,0,110,57,270 Common 40 5 40 ١ # ١ [661 ]0 ٤ 8 60,121,183,255,46,238,0,62,58,270 Common 41 5 41 ٤ # ٤ [664 ]0 ژ 1 0,63,192,255,71,190,0,22,59,193 Arabic 42 13 42 ژ # ژ [698 ]x چ 1 0,29,133,213,92,192,4,37,84,213 Arabic 43 13 43 چ # چ [686 ]x ۀ 1 59,121,206,255,40,134,0,20,48,165 Arabic 44 13 44 ۀ # ۀ [6c0 ]x
… 10 12,102,64,124,114,273,8,37,132,333 Common 96 10 96 ... # … [2026 ]p ٬ 10 62,236,164,255,21,53,9,77,52,193 Arabic 97 5 97 ٬ # ٬ [66c ]p \ 10 12,102,207,255,42,154,0,57,43,193 Common 98 10 98 \ # \ [5c ]p " 10 139,254,204,255,42,128,9,55,64,193 Common 99 10 99 " # " [22 ]p & 10 12,100,192,255,83,266,4,27,121,299 Common 100 10 100 & # & [26 ]p ٫ 10 15,98,97,167,33,103,6,61,52,193 Arabic 101 5 101 ٫ # ٫ [66b ]p ? 10 12,108,204,255,56,195,4,43,70,249 Common 102 10 102 ? # ? [3f ]p < 0 47,109,188,255,49,218,0,40,78,262 Common 103 10 107 < # < [3c ] 10 0,84,0,102,76,259,0,12,74,249 Common 104 10 104 # _ [5f ]p | 0 0,88,207,255,6,64,12,82,31,193 Common 105 10 105 | # | [7c ] ٪ 10 33,105,213,255,79,205,0,41,101,294 Arabic 106 4 106 ٪ # ٪ [66a ]p
0 47,109,188,255,49,222,3,33,78,262 Common 107 10 103 > # > [3e ]
Thanks for your reply
combine_tessdata -u tessdata_best/fas.traineddata fas.
This will unpack the traineddata file.
Look at fas.lstm-unicharset
That probably has the ZWNJ in it.
You can add a few additional lines to the training text in langdata which have ZWNJ
How do I install the fas
traineddata on macOS? Can someone provide the necessary commands to run?
Ref: https://github.com/tesseract-ocr/langdata/pull/76#issuecomment-320425422
copied below