tesseract-ocr / tessdata

Trained models with fast variant of the "best" LSTM models + legacy models
Apache License 2.0
6.39k stars 2.19k forks source link

Best Traineddata Feedback - Persian #70

Open Shreeshrii opened 7 years ago

Shreeshrii commented 7 years ago

Ref: https://github.com/tesseract-ocr/langdata/pull/76#issuecomment-320425422

copied below

Hello I'm a software engineering student and i use tesseract OCR engine in a university project. For persian language, traineddata which it's a file and it made by Training tesseract 4.00 and LSTM method, has a good result and output in Arial fonts but it doesn't have any good result in some specific fonts for persian. So the questions are : 1- did you use specific fonts like B Nazanin , B Roya or etc in Training Tesseract 4.00 with LSTM or not? 2- if they haven't used how can we use these fonts for getting better result? I prepared a text that all the cases of litrates have repeated for 10 or 15 or more than 15 times in this text. Also i used the process of training tesseract 3.05 for this text but i didn't get better and beneficial output. For achieving to a good result in persian in Tesseract OCR engine we need your experience and your help. Thanks for your attention Sincerely.

Shreeshrii commented 7 years ago

@aidinkrmz

Are B Nazanin , B Roya unicode fonts? Please try OCR with the latest BEST traineddata.

Shreeshrii commented 7 years ago

Also see https://github.com/tesseract-ocr/tessdata/issues/3

https://github.com/tesseract-ocr/langdata/issues/26

Shreeshrii commented 7 years ago

@theraysmith

Does language-specific.sh have the current list of fonts used by your BEST training?

ebraminio commented 7 years ago

Are B Nazanin , B Roya unicode fonts? Please try OCR with the latest BEST traineddata.

Those names usually indicating a family of fonts, if you can train a new set of font for Persian, from these page which provides OFL licensed fonts and used heavily on Persian Tex community, download these: XB Zar, XB Roya (equal to B Roya), XB Kayhan (somehow equal to B Nazanin).

Also here is another FOSS licensed Persian font bundle contains both Roya and Nazanin (under Nazli name). These ones have less glyph coverage but somehow more standard compliant.

You can have B Nazanin and B Roya themselves also but they are not released under a FOSS license, if that matters.

Please try OCR with the latest BEST traineddata.

Is LSTM based Persian traineddata released recently? How we can have a look?

aidinkrmz commented 7 years ago

@Shreeshrii ofcourse they are unicode fonts and also more than 90% percent of texts use this font family like B NAZANIN , B yaGHoot , B zar

aidinkrmz commented 7 years ago

@ebraminio salam zaheran shoma irani hastid bezarin ma moshkelemono ba shoma dar miyan bezarim traindatayi ke alan ma dar ekhtiyar darim ba fonte arial fgt dorost kar mikonan va matn hayi ke ba font haye irani mesle B nazanin ya B zar neveshte mishe dorost javab nemidan va kar nemikonan be nazareton chare chiye rahe hali hast?

reza1615 commented 7 years ago

03 I tested version 4 for attached image. it has these problems 1-doesn't recognize ZWNJ 2-doesn't recognize ● 3-has problem with Ligatures like لا 4- the image's font is B_nazanin 5-doesn't recognize ، ؛ ؟ (\u060C \u061B \u061F) 6- it's dictionary is not completed I suggest to use persian hunspell's data for example it doesn't recognize (ساخت - ناشی ) this data use by chrome (for more information look here)

Shreeshrii commented 7 years ago

Please see https://github.com/tesseract-ocr/tessdata/blob/master/best/fas.traineddata

for the 4.0alpha best model for persian, uploaded by @theraysmith just a few days back.

Your feedback will help him improve the next version of training for beta release.

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Aug 5, 2017 at 3:27 PM, mohammad reza notifications@github.com wrote:

I tested version 4 for attached image. it has these problems 1-doesn't recognize ZWNJ https://en.wikipedia.org/wiki/Zero-width_non-joiner 2-doesn't recognize ● 3-has problem with Ligatures like لا https://en.wikipedia.org/wiki/Arabic_alphabet#Ligatures

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata/issues/70#issuecomment-320434564, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_ozQyB1glSITgP-DcG29KGBcefIpJks5sVDyngaJpZM4OuZjP .

aidinkrmz commented 7 years ago

@reza1615 use vietocr 5 alpha you can get good result with it

reza1615 commented 7 years ago

@aidinkrmz also I tested vietocr 5 alpha. it has the same problem (you can test the attached image). vietocr is only an interface.

amitdo commented 7 years ago

Everyone who tests the new best traineddata should also update Tesseract to the latest commit.

reza1615 commented 7 years ago

is it the last version? tesseract-ocr-setup-4.0.0-alpha.20170804.exe I used this version and downloaded the data from the installation wizard

amitdo commented 7 years ago

Yes, that version is the last version.

aidinkrmz commented 7 years ago

@reza1615 but i test your pic without any error !!!!!!!!

reza1615 commented 7 years ago

@aidinkrmz for me it recognizes ساختمان‌ها as ساختمانها or ● as @ or لا as ا like بالاتری as باالتری

aidinkrmz commented 7 years ago

@reza1615 agha mohammad reza aziz 2 3 % khata ghabele hale kamel ke dg nabayad tashkhis bede ke on sakhtoman ha ham 2 3 darsad irade

reza1615 commented 7 years ago

@aidinkrmz you said it doesn't have any error!! now you says it has 2-3 % errors! I didn't say tesseract-ocr has fatal bug I said it should recognize these problems when these bugs could solve why we shouldn't solve them at the program? also, it doesn't recognize ZWNJ and it isn't a minor bug.

aidinkrmz commented 7 years ago

@reza1615 we talk abut traindata file so the traindata file doesn`t have any problem

Shreeshrii commented 7 years ago

@reza1615

it doesn't recognize ZWNJ and it isn't a minor bug.

Please elaborate on that with some examples and any suggestions you have for fixing. Thanks!

reza1615 commented 7 years ago

05 @Shreeshrii At this image yellow: doesn't recognize Zwnj Red: problem with Ligature لا Green: problem with ● Blue: confused . (dot) and ۰ (persian 0) Perpul: incorrect word

I attached the output file to compare them

reza1615 commented 7 years ago

In my opinion, @theraysmith trained the data from texts which don't have ZWNJ (U+200C). you can use fa.wikipedia Featured articles which uses correct persian handwriting at here and here there are many texts. you can collect the articles from here

ebraminio commented 7 years ago

@Shreeshrii: I see LSTM result so convincing here. Keep up the good work :)

Shreeshrii commented 7 years ago

Question from Ray in https://github.com/tesseract-ocr/langdata/issues/72

Anyone know which digits are needed for the other Arabic languages? kur_ara, pus, uig

khosrobeygizohre commented 7 years ago

Hi shree i tested BEST fas.traineddata. but it has problem. "لا".it can't recognize it and i did fine tune for "لا". and again had this problem and about xheight, i used Arabic.xheight for persian but after creating unicharset, this file had cleared.

Shreeshrii commented 7 years ago

@theraysmith Any update regarding the new training for RTL languages?

AbdelsalamHaa commented 6 years ago

Hi guys . I'm trying to find the traineddata fot arabic numbers only , can any body guide me where to find it thank you

im using tesseract 4 visual studio 2017 c++ i tried using the normal ara.traindata it doesn't seems okay at all image

the results الرقم /1 ١ ١5 .//ا1١1؟؟ @theraysmith @Shreeshrii

zeinabfarhoudi commented 6 years ago

Hi. I tested BEST fas.traineddata but it had some errors. for example it couldn't recognize 'ی' character for some fonts. I integrate some specific fonts such as "B Nazanin" "B Zar" "B Lotus" by fine tuning the pre-training model. After testing the new .traineddata, It could recognize some fonts better than the BEST fas.traineddata. but it couldn't recognize ZWNJ. However with BEST fas.traineddata I could recognize ZWNJ.Nnow my questions are: 1- What fonts did you used for training and making the BEST fas.traineddata? 2- How should I training tesseract-ocr 4.0 to recognize ZWNJ in Persian language?

Shreeshrii commented 6 years ago

@AbdelsalamHaa

Did you try with script/Arabic.traineddata?

Shreeshrii commented 6 years ago

@farhodi What training text did you use for fine tuning? did it have any ZWNJ in it?

Take a look at the unicharset from your trained data and compare with the one in the repository.

Make sure you have all needed characters in your training text.

Regarding the font list and training done originally by Ray Smith, we are awaiting updates to langdata.

reza6966 commented 6 years ago

hi, i test new version of tesseract (4 beta) on persian language. the results its good but there are some errors. for examples : 1) in different char that have same shape with different dot location or number of dots. (ex. بـ تـ ثـ یـ or ز ر ژ) 2) in some cases when there are same word in doc, the results of these same words are different. (ex. word="نویسه") 0006 3) i think at the end of process, does not apply dictionary correction, is it true ? 4) and how could we train more fonts ?

thanks

zeinabfarhoudi commented 6 years ago

@Shreeshrii I've used the same training text in langdata "fas" folder for fine tuning. Just add new fonts for training. Also I couldn't find fas.unicharset at langdata repository to compare with my .unicharset. The .unicharset I've used is as follow:

108 NULL 0 NULL 0 Joined 7 0,255,0,255,0,0,0,0,0,0 Latin 1 0 1 Joined # Joined [4a 6f 69 6e 65 64 ]a |Broken|0|1 f 0,255,0,255,0,0,0,0,0,0 Common 2 10 2 |Broken|0|1 # Broken و 1 0,68,137,238,65,290,0,27,62,256 Arabic 3 13 3 و # و [648 ]x ه 1 55,123,147,255,35,181,6,64,48,222 Arabic 4 13 4 ه # ه [647 ]x ک 1 47,121,200,255,131,288,0,45,124,305 Arabic 5 13 5 ک # ک [6a9 ]x ن 1 0,88,163,255,68,321,0,52,76,354 Arabic 6 13 6 ن # ن [646 ]x ی 1 0,71,148,225,95,253,0,45,103,279 Arabic 7 13 7 ی # ی [6cc ]x ا 1 26,117,200,255,11,181,7,82,33,222 Arabic 8 13 8 ا # ا [627 ]x خ 1 0,66,172,255,92,262,2,37,84,290 Arabic 9 13 9 خ # خ [62e ]x س 1 0,64,140,228,123,493,0,50,132,523 Arabic 10 13 10 س # س [633 ]x ع 1 0,64,148,255,98,239,2,37,81,276 Arabic 11 13 11 ع # ع [639 ]x ض 1 0,64,174,255,131,619,0,50,132,654 Arabic 12 13 12 ض # ض [636 ]x م 1 0,64,134,241,51,272,0,46,56,313 Arabic 13 13 13 م # م [645 ]x ل 1 0,96,200,255,62,328,0,50,71,332 Arabic 14 13 14 ل # ل [644 ]x ف 1 44,125,202,255,113,339,0,47,123,378 Arabic 15 13 15 ف # ف [641 ]x ر 1 0,63,137,224,45,297,0,22,59,244 Arabic 16 13 16 ر # ر [631 ]x پ 1 0,42,142,217,113,258,2,50,123,288 Arabic 17 13 17 پ # پ [67e ]x د 1 49,123,163,250,43,467,0,70,59,503 Arabic 18 13 18 د # د [62f ]x ت 1 58,123,170,255,113,339,2,50,123,378 Arabic 19 13 19 ت # ت [62a ]x . 10 12,108,64,140,18,52,9,77,52,193 Common 20 6 20 . # . [2e ]p ج 1 0,64,133,255,92,262,2,37,84,290 Arabic 21 13 21 ج # ج [62c ]x ق 1 0,79,179,255,84,310,0,52,88,345 Arabic 22 13 22 ق # ق [642 ]x ش 1 0,64,196,255,123,493,0,50,132,523 Arabic 23 13 23 ش # ش [634 ]x ز 1 0,63,167,255,45,298,0,22,59,242 Arabic 24 13 24 ز # ز [632 ]x : 10 12,108,157,255,18,58,11,77,52,193 Common 25 6 25 : # : [3a ]p ب 1 0,71,140,224,113,339,0,50,123,378 Arabic 26 13 26 ب # ب [628 ]x آ 1 26,117,230,255,36,161,0,58,33,198 Arabic 27 13 27 آ # آ [622 ]x ي 1 0,56,148,255,95,431,0,45,103,467 Arabic 28 13 28 ي # ي [64a ]x گ 1 47,125,208,255,131,289,0,45,132,305 Arabic 29 13 29 گ # گ [6af ]x , 10 0,72,69,140,21,62,8,65,39,193 Common 30 6 30 , # , [2c ]p غ 1 0,64,196,255,98,239,2,37,81,276 Arabic 31 13 31 غ # غ [63a ]x ح 1 0,64,133,255,92,262,2,37,84,290 Arabic 32 13 32 ح # ح [62d ]x = 0 86,150,160,244,90,218,3,33,99,262 Common 33 10 33 = # = [3d ] } 10 0,67,210,255,37,125,4,46,54,193 Common 34 10 86 } # } [7d ]p ك 1 49,123,203,255,91,451,0,50,103,483 Arabic 35 13 35 ك # ك [643 ]x / 10 12,102,224,255,43,166,0,29,54,193 Common 36 6 36 / # / [2f ]p ٧ 8 58,125,181,255,70,211,0,65,87,270 Common 37 5 37 ٧ # ٧ [667 ]0 ٨ 8 58,123,179,255,70,235,0,65,88,270 Common 38 5 38 ٨ # ٨ [668 ]0 ٣ 8 55,121,184,255,71,235,0,65,88,338 Common 39 5 39 ٣ # ٣ [663 ]0 ١ 8 55,121,184,255,13,134,0,110,57,270 Common 40 5 40 ١ # ١ [661 ]0 ٤ 8 60,121,183,255,46,238,0,62,58,270 Common 41 5 41 ٤ # ٤ [664 ]0 ژ 1 0,63,192,255,71,190,0,22,59,193 Arabic 42 13 42 ژ # ژ [698 ]x چ 1 0,29,133,213,92,192,4,37,84,213 Arabic 43 13 43 چ # چ [686 ]x ۀ 1 59,121,206,255,40,134,0,20,48,165 Arabic 44 13 44 ۀ # ۀ [6c0 ]x

Thanks for your reply

Shreeshrii commented 6 years ago

combine_tessdata -u tessdata_best/fas.traineddata fas.

This will unpack the traineddata file.

Look at fas.lstm-unicharset

That probably has the ZWNJ in it.

You can add a few additional lines to the training text in langdata which have ZWNJ

NightMachinery commented 2 years ago

How do I install the fas traineddata on macOS? Can someone provide the necessary commands to run?

amitdo commented 2 years ago

@NightMachinery,

Please use our forum for asking questions.