tesseract-ocr / langdata

Source training data for Tesseract for lots of languages
Apache License 2.0
827 stars 886 forks source link

Would like to help for Burmese/Myanmar language training? #13

Open herzcthu opened 9 years ago

herzcthu commented 9 years ago

Hello, I would like to help. I've already cloned all repository. How do I start?

zdenop commented 9 years ago

What issue is there with Burmese/Myanmar language?

herzcthu commented 9 years ago

We have 2 types of unicode font. Non standard unicode font and standard unicode font. When I check langdata files for Burmese, most words are incorrect. I guess you have generated mixed contents with non standard unicode contents and standard unicode contents. When I try to scan an image with Burmese character written in Padauk fonts, output contents are not readable. I would like to know method you've used to generate Burmese training files. Where did you get original data? I can check if it is standard unicode contents or not.

minthanthtoo commented 8 years ago

I think the real issue is not only about using standard or non-standard Unicode, but also the wrong method of extracting data from the source. I mean the source data need to be segmented correctly to get a correct single word. Myanmar language users do not much care about adding a 'space' character between words; this results in false perception of two or more words as a single word, when you assume all characters between 2 'space' characters as a word. I found most word lists here ,especially bi-grams holds too long Myanmar phrases. That makes the wordlists unusable and the results of its appliction is totally unpredictable So I think you need to extract data from a source using dictionary-lookup approach. Of course, you need to build your own wordlist manually or use those made by others. Also Myanmar language is a syllable-based language; that is one or more Myanmar letters combine to form a syllable and one or more syllables join to form a word. So it is advisable to detect syllables so that you can gain much performance improvement in dictionary-looking up.

Shreeshrii commented 7 years ago

@herzcthu @minthanthtoo

Please add some good sources of standard unicode fonts and sample texts and word frequency lists to https://github.com/tesseract-ocr/langdata/issues/46

herzcthu commented 7 years ago

https://my.wikipedia.org/ All contents on wikipedia are in standard unicode font.

nengine commented 7 years ago

@zdenop Issue is with training data itself. The person who prepared the data, does not know the Myanmar language. Majority of the training data has misspellings and mixed with hacked version of Myanmar Unicode as said by @herzcthu . You can imagine rice and spaghetti mixed in a bowl. Also, it is not segmented properly as @minthanthtoo pointed out. Any suggestions to on how to prepare training data?

Shreeshrii commented 7 years ago

Please see Ray's comment at https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951

about how the training data is being built for the 4.0 LSTM training. I don't think they are using the training_text file in langdata.

nengine commented 7 years ago

Thanks @Shreeshrii ./tesstrain.sh would automatically create .tff/box pairs from langdata directory for 4.0 LSTM training?

Shreeshrii commented 7 years ago

Yes. Tesstrain.sh creates tiff box pairs that can be used for LSTM training. Please see wiki pages regarding details. You need large amount of training data for good training. See Ray's comments about LSTM training process.

Shreeshrii commented 7 years ago

https://github.com/tesseract-ocr/tesseract/issues/654

will add the code to the github repo in due course, so experts/native speakers can offer suggestions/fixes to make them better. Myanmar in particular needs improvement, as the www data is littered with dotted circles, and the unicode book does not adequately describe the syntax for a well-formed grapheme in Myanmar (or any other language for that matter).

Shreeshrii commented 7 years ago

copied from https://github.com/tesseract-ocr/langdata/issues/46

@herzcthu commented

Myanmar wordlists https://github.com/kanaung/wordlists


https://github.com/kanyawtech/myanmar-karen-word-lists/blob/master/burmese-word-list.txt?raw=true

Is this a good wordlist in standard unicode for mynamar?

nengine commented 7 years ago

These are the most common words in Myanmar, but it is not a complete list. The definition of a word itself is tricky in Myanmar language because there are many ways syllables can be combined to form a word. I am not so sure how Tesseract training works, but it may be better to train on the syllables instead of a word(cluster of syllables). Which is also to say that each syllable must be first detected and then do the classification. Classifying entire word may be too difficult, unless I am not fully aware of Tesseract capabilities.

amitdo commented 7 years ago

Manually Constructed Context-Free Grammar For Myanmar Syllable Structure http://www.aclweb.org/anthology/E12-3004

amitdo commented 7 years ago

Representing Myanmar in Unicode Details and Examples http://unicode.org/notes/tn11/myanmar_uni-v2.pdf http://www.tuninst.net/LINGUISTICS/myanmar-unicode/myanmar-unicode.htm

Creating and Supporting OpenType Fonts for Myanmar Script https://www.microsoft.com/typography/OpenTypeDev/myanmar/intro.htm

Myanmar script notes http://rishida.net/scripts/myanmar/#shaping

https://www.researchgate.net/publication/253745697_A_Rule-based_Syllable_Segmentation_of_Myanmar_Text

Shreeshrii commented 7 years ago

@theraysmith

I used a few words from the burmese wordlist and the landing page of wikipedia as a small training sample to test mynamar. Both of these are supposed to be in standard unicode for mynamar.

training text and generated unicharset are attached. I got a number of errors while building unicharset. Maybe the mynamar.unicharset in langdata needs to be updated???


=== Phase UP: Generating unicharset and unichar properties files ===
[Fri Mar 31 16:07:02 DST 2017] /usr/local/bin/unicharset_extractor -D /tmp/tmp.OzCvDLSWBp/mya/ /tmp/tmp.OzCvDLSWBp/mya/mya.Myanmar_Text_Bold.exp0.box /tmp/tmp.OzCvDLSW
Bp/mya/mya.Myanmar_Text.exp0.box
Extracting unicharset from /tmp/tmp.OzCvDLSWBp/mya/mya.Myanmar_Text_Bold.exp0.box
Extracting unicharset from /tmp/tmp.OzCvDLSWBp/mya/mya.Myanmar_Text.exp0.box
Wrote unicharset file /tmp/tmp.OzCvDLSWBp/mya//unicharset.
[Fri Mar 31 16:07:05 DST 2017] /usr/local/bin/set_unicharset_properties -U /tmp/tmp.OzCvDLSWBp/mya/mya.unicharset -O /tmp/tmp.OzCvDLSWBp/mya/mya.unicharset -X /tmp/tmp
.OzCvDLSWBp/mya/mya.xheights --script_dir=../langdata
Loaded unicharset of size 217 from file /tmp/tmp.OzCvDLSWBp/mya/mya.unicharset
Setting unichar properties
Other case È of è is not in unicharset
Other case Ë of ë is not in unicharset
Warning: properties incomplete for index 4 = ယ်
Warning: properties incomplete for index 5 = လ်
Warning: properties incomplete for index 8 = မ်
Warning: properties incomplete for index 10 = င်
Warning: properties incomplete for index 16 = မှ
Warning: properties incomplete for index 22 = ရှ
Warning: properties incomplete for index 28 = ဖွဲ့
Warning: properties incomplete for index 30 = ည်
Warning: properties incomplete for index 36 = ပ်
Warning: properties incomplete for index 37 = ဖြ
Warning: properties incomplete for index 38 = င့်
Warning: properties incomplete for index 41 = က်
Warning: properties incomplete for index 42 = နှာ
Warning: properties incomplete for index 43 = ည်း
Warning: properties incomplete for index 44 = တ်
Warning: properties incomplete for index 45 = မှု
Warning: properties incomplete for index 47 = မ်း
Warning: properties incomplete for index 50 = ခြ
Warning: properties incomplete for index 51 = င်း
Warning: properties incomplete for index 52 = ကြော
Warning: properties incomplete for index 53 = နှို
Warning: properties incomplete for index 54 = ချွ
Warning: properties incomplete for index 63 = ပွဲ
Warning: properties incomplete for index 64 = တွေ
Warning: properties incomplete for index 65 = မှာ
Warning: properties incomplete for index 66 = ဆွေး
Warning: properties incomplete for index 67 = နွေး
Warning: properties incomplete for index 73 = ထွေ
Warning: properties incomplete for index 78 = မြ
Warning: properties incomplete for index 79 = စ်
Warning: properties incomplete for index 80 = မြို့
Warning: properties incomplete for index 83 = န်
Warning: properties incomplete for index 86 = ကွ
Warning: properties incomplete for index 89 = သွ
Warning: properties incomplete for index 92 = ဖ်
Warning: properties incomplete for index 96 = ခြေ
Warning: properties incomplete for index 100 = မျှ
Warning: properties incomplete for index 101 = ဂြို
Warning: properties incomplete for index 102 = ဟ်
Warning: properties incomplete for index 103 = တွ
Warning: properties incomplete for index 110 = ရှု
Warning: properties incomplete for index 119 = ညွှ
Warning: properties incomplete for index 120 = န်း
Warning: properties incomplete for index 123 = ကြ
Warning: properties incomplete for index 124 = ည့်
Warning: properties incomplete for index 125 = နှ
Warning: properties incomplete for index 126 = ထွ
Warning: properties incomplete for index 130 = ရှိ
Warning: properties incomplete for index 132 = ကြို
Warning: properties incomplete for index 140 = ဉ်
Warning: properties incomplete for index 150 = လှ
Warning: properties incomplete for index 151 = သွား
Warning: properties incomplete for index 153 = ထွာ
Warning: properties incomplete for index 154 = ထွား
Warning: properties incomplete for index 157 = ဖွံ့
Warning: properties incomplete for index 158 = မွ
Warning: properties incomplete for index 159 = လျော်
Warning: properties incomplete for index 162 = ပြော
Warning: properties incomplete for index 163 = ထွေး
Warning: properties incomplete for index 164 = ယှ
Warning: properties incomplete for index 168 = ဘွား
Warning: properties incomplete for index 179 = လွ
Warning: properties incomplete for index 182 = န့်
Warning: properties incomplete for index 189 = စွဲ
Warning: properties incomplete for index 192 = ပြီး
Warning: properties incomplete for index 197 = မြေ
Warning: properties incomplete for index 202 = ကွာ
Warning: properties incomplete for index 210 = ရှာ
Warning: properties incomplete for index 211 = ဖွေ
Warning: properties incomplete for index 212 = တွေ့
Warning: properties incomplete for index 214 = ပြ
Warning: properties incomplete for index 215 = ကြာ
Writing unicharset to file /tmp/tmp.OzCvDLSWBp/mya/mya.unicharset

mya.Myanmar_Text.exp0.txt mya.Myanmar_Text_Bold.exp0.txt mya.unicharset.txt

Shreeshrii commented 7 years ago

@herzcthu @nengine @minthanthtoo

Please take a look at https://github.com/tesseract-ocr/langdata/blob/master/Myanmar.unicharset in light of the above warning messages. Do you notice any pattern for the errors?

Tesseract does train on syllables (for Indic languages) AFAIK. Please see https://github.com/tesseract-ocr/langdata/files/885327/mya.unicharset.txt generated from the two training files - all listed in the message above.

Shreeshrii commented 7 years ago

@theraysmith do zwj and zwnj also have to be part of unicharset?

also see http://archive.mmgeeks.com/index.php?p=/discussion/379/zwnj-and-zwj

amitdo commented 7 years ago

https://github.com/khzaw/awesome-myanmar-unicode

Shreeshrii commented 7 years ago

Syllabification, Normalization and Lexicographic Ordering of Myanmar Texts using Formal Approaches

http://ir.nagaokaut.ac.jp/dspace/bitstream/10649/729/1/k709.pdf

nengine commented 7 years ago

I do not see consistent pattern.

  1. Warning: properties incomplete for index 4 = ယ် . ယ် by itself does not have any meaning, but when it is combined with ဘ which becomes ဘယ် it makes sense.

  2. Warning: properties incomplete for index 16 = မှ . မှ by itself does make sense and has a meaning, but not so sure why it is giving a warning.

Myanmar.unicharset clearly does not include these syllables shown in the warnings, but just consonants, vowels, etc.

It is suppose to include all syllable combinations in Myanmar.unicharset ? How does it work for Telugu for example?

Shreeshrii commented 7 years ago

I don't think it is supposed to include all syllable combinations in Myanmar.unicharset but it should have all vowels, consonants, vowel signs.

I see three ranges for mynamar, first seems to be there in the unicharset, part of second and none of third.

Can you please check whether all of these are required?

http://www.alanwood.net/unicode/myanmar.html

http://www.alanwood.net/unicode/myanmar-extended-a.html

http://www.alanwood.net/unicode/myanmar-extended-b.html

nengine commented 7 years ago

There are 8 major ethnic groups in Myanmar, so I believe extended A and B are added for that reason. So, for completeness I think it should be added, but you would rarely see them on the web. Unicode range 1000 - 104F is already good.

Shreeshrii commented 7 years ago

you would rarely see them on the web.

What about in books / documents that need to be OCRed?

nengine commented 7 years ago

Yes, extended A and B should also be added for completeness as I said, but as far as for training samples, it is almost non existence on the web.


From: Shreeshrii notifications@github.com Sent: Friday, March 31, 2017 1:22 PM To: tesseract-ocr/langdata Cc: nengine; Mention Subject: Re: [tesseract-ocr/langdata] Would like to help for Burmese/Myanmar language training? (#13)

you would rarely see them on the web.

What about in books / documents that need to be OCRed?

On 31-Mar-2017 10:35 PM, "nengine" notifications@github.com wrote:

There are 8 major ethnic groups in Myanmar, so I believe extended A and B are added for that reason. So, for completeness I think it should be added, but you would rarely see them on the web. Unicode range 1000 - 104F is already good.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/13#issuecomment-290770704, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o5Kd-mzKd7Mg_tmQQirf-TZ1frzWks5rrTJYgaJpZM4FRqc3 .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/tesseract-ocr/langdata/issues/13#issuecomment-290774836, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAFECn5K6buH3pJrMpulDKKaWQYZNToKks5rrTZcgaJpZM4FRqc3.

theraysmith commented 7 years ago

Please take a look at this reference: http://www.unicode.org/versions/Unicode9.0.0/ch16.pdf Table 16-3. The text says "Characters occur in the relative order shown in Table 16-3" which I do not believe to be completely correct. Part of the problem is that a lot of the characters are not even in this table! Although it is possible to guess which group the extensions belong to, I'm not convinced I have it correct. I have some code that implements this table plus my guesses to add the extensions, but it isn't ready for committing to github just yet.

The problem is that I need to exclude the incorrectly formatted text (that uses the non-standard fonts), but be sure that no correctly formatted text is dropped.

On Fri, Mar 31, 2017 at 10:49 AM, nengine notifications@github.com wrote:

Yes, extended A and B should also be added for completeness as I said, but as far as for training samples, it is almost non existence on the web.


From: Shreeshrii notifications@github.com Sent: Friday, March 31, 2017 1:22 PM To: tesseract-ocr/langdata Cc: nengine; Mention Subject: Re: [tesseract-ocr/langdata] Would like to help for Burmese/Myanmar language training? (#13)

you would rarely see them on the web.

What about in books / documents that need to be OCRed?

  • excuse the brevity, sent from mobile

On 31-Mar-2017 10:35 PM, "nengine" notifications@github.com wrote:

There are 8 major ethnic groups in Myanmar, so I believe extended A and B are added for that reason. So, for completeness I think it should be added, but you would rarely see them on the web. Unicode range 1000 - 104F is already good.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/13# issuecomment-290770704, or mute the thread https://github.com/notifications/unsubscribe- auth/AE2_o5Kd-mzKd7Mg_tmQQirf-TZ1frzWks5rrTJYgaJpZM4FRqc3 .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ tesseract-ocr/langdata/issues/13#issuecomment-290774836, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ AAFECn5K6buH3pJrMpulDKKaWQYZNToKks5rrTZcgaJpZM4FRqc3.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/13#issuecomment-290781430, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056cZdXfwKdFL1EpH01k8FRXDHh6NTks5rrTyNgaJpZM4FRqc3 .

-- Ray.

herzcthu commented 7 years ago

I've checked characters in Myanmar.unicharset file. All characters seem correct.

Shreeshrii commented 6 years ago

Please see https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-315133403

When I have committed the new corpus cleanup code, it would be useful to have any experts in any of the following scripts review the code and make comments: Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Myanmar, Khmer. There are script-specific cleanup rules in there. Since I plan to commit new copies of the training data (unicharsets, wordlists, training text etc) then at that point they will match

For instance, there is a big table in the unicode standard for Myanmar, ( http://www.unicode.org/versions/Unicode9.0.0/ch16.pdf) but it doesn't cover any of the extension Myanmar characters, and isn't explicit about whether the table represents a specific valid order or not. The existence of a lot of legacy Myanmar text on the web that is designed for non-compliant fonts doesn't help make it easier to determine whether the filter is correct.

theraysmith commented 6 years ago

Please see code at: https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_myanmar.cpp

On Thu, Jul 13, 2017 at 10:21 PM, Shreeshrii notifications@github.com wrote:

Please see tesseract-ocr/tesseract#995 (comment) https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-315133403

For instance, there is a big table in the unicode standard for Myanmar, ( http://www.unicode.org/versions/Unicode9.0.0/ch16.pdf) but it doesn't cover any of the extension Myanmar characters, and isn't explicit about whether the table represents a specific valid order or not. The existence of a lot of legacy Myanmar text on the web that is designed for non-compliant fonts doesn't help make it easier to determine whether the filter is correct.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/13#issuecomment-315272798, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056YM0MGz07l7tSTpJWUPO5bNE1W6rks5sNvrygaJpZM4FRqc3 .

-- Ray.

Shreeshrii commented 6 years ago

@herzcthu @nengine @minthanthtoo

Please test with the new traineddata in tessdata/best directory and provide feedback.

herzcthu commented 6 years ago

I'm testing new traineddata. It has improved a lot. Almost 98% correct. I will test more in detail and will provide feedback in detail later.

nengine commented 6 years ago

I like to test it but not so sure how to do it. I have Windows 10 installed. Could you please point to the documentation link?

Shreeshrii commented 6 years ago

You can use new windows binaries for 4.0 linked from https://github.com/UB-Mannheim/tesseract/wiki

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Aug 10, 2017 at 1:56 AM, nengine notifications@github.com wrote:

I like to test it but not so sure how to do it. I have Windows 10 installed. Could you please point to the documentation link?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/13#issuecomment-321371812, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o6HxqCOXKZDOZJnKpjwqm-GSf-qQks5sWhYMgaJpZM4FRqc3 .

herzcthu commented 6 years ago

both

I've attached first screenshot I've tested. Upper part is image I've tested and lower part is OCR converted text. Words between two adjacent same color points are missing or incorrect.
If you need code point comparison between source image and output text. I can provide later.

Shreeshrii commented 6 years ago

It would be helpful if you can point out to any pattern that you notice in the errors.

I think one that I notice is that words are getting dropped in the OCRed text (missing).

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Aug 10, 2017 at 7:13 PM, Sithu Thwin notifications@github.com wrote:

[image: both] https://user-images.githubusercontent.com/3231665/29173007-d4bc484e-7e07-11e7-9036-0462da3ac580.png

I've attached first screenshot I've tested. Upper part is image I've tested and lower part is OCR converted text. Words between two adjacent same color points are missing or incorrect. If you need code point comparison between source image and output text. I can provide later.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/13#issuecomment-321554871, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_oyNJ4LqZqWLODpuyuvJISRw7EC2qks5sWwkGgaJpZM4FRqc3 .

kyawswa commented 6 years ago

Hello, I would like to share what I found in Myanmar training data.

I used tesseract version 3.04.

I think It still need to improve a lot. Firstly I would like to tell the my test result. I tried to test with three image with myanmar language: ocr_sample_1.png and ocr_sample_2.png.

_test result for ocr_sample1 image is below. I marked with red point to see different. image_file ocr_sample_1

Result screenshot from ocr_sample_1

And the second ocr_sample_2 image result is below. It's result is completely worng. It means "how are you" in English.

Image_file ocr_sample_2

Result screenshot from ocr_sample_2

And then I download the myanmar langdata from github.(https://github.com/tesseract-ocr/langdata). I found 7 files. After I check those file, most of the contents are incorrect, misspelling. I would like to show the one or two incorrect data from one of those file named mya.training_text. For example, screenshot from 2017-10-12 21-01-05

first arrow head line

It should be "ရုတ်ရုတ်သဲသဲ".

Second arrow head line

It should be "သစ်တောများကုန်".

third arrow head line

should be "ပညာရေးစနစ်". so on.

So I would like to contribute to make the correction for these 7 files. And I would like to ask the following questions. -Exporting mya.traineddata is based on those file? -How can I know which file is used for what? eg. what is mya.punc file? -And where did you get those data? -Is there any format or rule to put data into those files?

Could you please explain me about those files? I am also willing to improve Myanmar language in OCR.

Thanks you for your contribution.

Shreeshrii commented 6 years ago

Please also try tesseract 4.0alpha which might have improved results.

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Oct 12, 2017 at 8:52 PM, kyawswar notifications@github.com wrote:

Hello, I would like to share what I found in Myanmar training data.

I used tesseract version 3.04.

I think It still need to improve a lot. Firstly I would like to tell the my test result. I tried to test with three image with myanmar language: ocr_sample_1.png and ocr_sample_2.png.

test result for ocr_sample_1 image is below. I marked with red point to see different. image_file [image: ocr_sample_1] https://user-images.githubusercontent.com/4832700/31503684-54c914c6-af96-11e7-8cb5-ebfbe85dc1c4.jpg

Result [image: screenshot from ocr_sample_1] https://user-images.githubusercontent.com/4832700/31503452-b8ce91c2-af95-11e7-96fa-256a19394daf.png

And the second ocr_sample_2 image result is below. It's result is completely worng. It means "how are you" in English.

Image_file [image: ocr_sample_2] https://user-images.githubusercontent.com/4832700/31503742-7748882e-af96-11e7-82e1-d4189ec553d0.png

Result [image: screenshot from ocr_sample_2] https://user-images.githubusercontent.com/4832700/31503479-c62a7e26-af95-11e7-93d5-442ddb1fd637.png

And then I download the myanmar langdata from github.(https://github.com/ tesseract-ocr/langdata). I found 7 files. After I check those file, most of the contents are incorrect, misspelling. I would like to show the one or two incorrect data from one of those file named mya.training_text. For example, [image: screenshot from 2017-10-12 21-01-05] https://user-images.githubusercontent.com/4832700/31503553-f608e970-af95-11e7-910a-af7ceeb2852d.png

first arrow head line

It should be "ရုတ်ရုတ်သဲသဲ".

Second arrow head line

It should be "သစ်တောများကုန်".

third arrow head line

should be "ပညာရေးစနစ်". so on.

So I would like to contribute to make the correction for these 7 files. And I would like to ask the following questions. -Exporting mya.traineddata is based on those file? -How can I know which file is used for what? eg. what is mya.punc file? -And where did you get those data? -Is there any format or rule to put data into those files?

Could you please explain me about those files? I am also willing to improve Myanmar language in OCR.

Thanks you for your contribution.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/13#issuecomment-336171733, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o2vO5yL2GI7dhu_pWBoDZhKv9iS5ks5sri7PgaJpZM4FRqc3 .

kyawswa commented 6 years ago

Yes, I used UB Mannheim with tesseract 4.0.0-alpha.20170804. I test with the following image files. The following is test result.

ocr_sample_1.png ocr_sample_1

Result

%%%%% ©05080×05 5082:40:82! 0=2405005$2³050

ocr_sample_2.png ocr_sample_2

Result

ပဵနႚတ္ဂဵကာဧ်တ္အီးလာသီူး

Thanks.

Shreeshrii commented 6 years ago

langdata repo has not been updated for 4.0x.

You can extract the wordlist from the tessdata_best traineddata file. Use the commands (please lookup the syntax)

combine_tessdata -u ....

dawg2wordlist ...

to see the version of files used for 4.0

You can compare this wordlist to the wordlist in langdata for spelling etc.

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Oct 13, 2017 at 8:39 AM, kyawswar notifications@github.com wrote:

Yes, I used tesseract 4.0.0-alpha.20170804. I test with the following image files. The following is test result.

ocr_sample_1.png [image: ocr_sample_1] https://user-images.githubusercontent.com/4832700/31528553-fff13d48-aff9-11e7-9fca-987a0e68c90c.png

Result

%%%%% ©05080×05 5082:40:82! 0=2405005$2³050

ocr_sample_2.png [image: ocr_sample_2] https://user-images.githubusercontent.com/4832700/31528592-396454f2-affa-11e7-9139-f6954eba8ef4.png

Result

ပဵနႚတ္ဂဵကာဧ်တ္အီးလာသီူး

Thanks.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/13#issuecomment-336338245, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o2fsRUsCSdN8ou-Yk-WMWrzlwieKks5srtR_gaJpZM4FRqc3 .

kyawswa commented 6 years ago

Yes, it works perfectly on tessdata_best. But after I checking wordlist, there are many misspelling and incorrect data. I point out the outstanding misspellings. Please see the following attachment.

ss_21

The most of data from following link is not included in this tessdata_best wordlist. https://github.com/kanaung/wordlists

I would like to know where did u get that data.

How can I contribute to update those incorrect data? Thanks.

pndaza commented 6 years ago

myanmar traineddata 4.0 does not recognized for the following chars. ၊(u104a) ။(u104b) ၌(u104c) ၍(u104d) ၎(u104e) ၏(u104f)

I unpacked and checked unicharset. I not found these char in unicharset.

Also unicharset-extractor does not produce these chars.

kpfoley commented 5 years ago

I'm just reading through this thread and have a few pointers, some maybe a little repetitive - hope this helps!

1) Unicode range: just about everything in contemporary use is in the 1000-104F range. The second half of the base range is for special characters needed for writing a number of other languages spoken in Myanmar but also for some Pali words that might be needed, for example, for religious and historical texts. I'm not sure whether users in Myanmar would rather see a model that supports the full range or a more compact model trained on the labels in the first half of the range, which are MUCH more frequently used. In any case training on the 1000-104F range will probably cover > 99.9% of the data (just a guess).

2) Word lists - the best word list available to researchers AFAIK is a 133k word list maintained by the Myanmar Language Commission (MLC). This is basically just a list of all the words from the large official dictionary of the language. I don't think the word list is publicly available, and it doesn't have any proper nouns, but it would be a good resource if it could be made available for this project. The kanaung word list linked above by @herzcthu is much better than the mya wordlist and I think it is based on an old software release of a "correct spelling" word list and has since incorporated words and place names from other sources like the postal system. I think there are a good deal of non-words and very rare words in that list, though, which maybe limits its usefulness.

3) Words and spacing - as mentioned above, Burmese / Myanmar doesn't use spaces between every word in its writing system. Spaces are thrown in as needed for typography and usually between multi-word phrases. So the best options for a language model are probably either a character-based language model or a model that incorporates word segmentation (prediction of spaces) into the pipeline. I'm not sure if either of these are possible with Tesseract 4.0. As somebody already mentioned above, the lack of spacing is probably why there are >500,000 words in the mya.wordlist file. As of right now I think the state of the art in Burmese word segmentation (breaking non-spaced continuous text into words) is around 99 percent accuracy -- it's not perfect but it's pretty good.

4) Syllable segmentation, such as Ye Kyaw Thu's script here https://github.com/ye-kyaw-thu/myPOS/tree/master/corpus-draft-ver-1.0, might also be simpler and more effective than a character-level language model in the absence of word segmentation. A few lines of regex can capture the boundaries between syllables, in which case the language model approach can be similar to what you might use for Chinese (sequences of syllables instead of sequences of words).

5) Unicode and Zawgyi - the choice between Unicode and Zawgyi is controversial in Myanmar, and Zawgyi is much more popular, with Unicode maybe making some inroads. The problem with Zawgyi, in addition to being non-standard, is that it takes up random spaces in the shared Myanmar unicode range, including spaces for other languages spoken in Myanmar, so it breaks not just Burmese but also the entire extended Myanmar unicode range. It's also frustrating because it used to be difficult to cleanly convert from Zawgyi to Unicode and back, and it still doesn't work perfectly because Zawgyi hides many typing errors by superimposing the same typed letter without advancing the cursor. For this reason there is a lot of corrupted Burmese language text data on the web that may have been converted to unicode at some point (or could be converted) but it's difficult to catch all of the errors left over from the original typed Zawgyi input. Wikipedia has a good explainer here on the two encoding standards: https://my.wikipedia.org/wiki/Wikipedia:Font#Why_not_Zawgyi? The main thing to watch out for here is not to accidentally feed Zawgyi text into the training data, because with so many overlapping codepoints it could wreck the accuracy of the model.

Shreeshrii commented 5 years ago

Thank you for the detailed notes. Please review the source training data in langdata_lstm repo also.

Shreeshrii commented 5 years ago

Please test the traineddata at https://github.com/Shreeshrii/tessdata_shreetest/blob/master/mya430000.traineddata

and let me know whether it is an improvement over the existing traineddata files.

herzcthu commented 5 years ago

Hi Shreeshrii, I've tested your traineddata. It is a little improved over existing traineddata in tesseract 4.0 beta. Especially it can detect punctuation and non-burmese characters better.

BTW, I'm trying to train myself, I'm generating lots of box and tif files for only one font. Is that a good idea to have many files for single font? Or should I make it only one box and one tif file. Currently I have more than 1000 files.

thanks and regards, Sithu

Shreeshrii commented 5 years ago

The amount of training data you need depends on the type of training that you are planning to do. eg. from scratch, replace a layer, plus minus, etc.

I think multiple files for single font may be ok. How are you generating these files?

You should try to keep approximately the same number of lines in each file so that all samples are used in a uniform way for training.

Shreeshrii commented 5 years ago

I had used 'Myanmar Khyay' \ 'Myanmar Sans Pro' \ 'Myanmar Text' \ 'Noto Sans Myanmar' \

Which one is a more representative font out of these for training and testing?

herzcthu commented 5 years ago

I took one paragraph from wikipedia. Make screenshots with all fonts you have mentioned. Noto Sans Myanmar has best result. There is new fonts which will be used in officials documents. It is called Pyidaungsu You can download here https://www.unicode.today/fonts-download/

I'm creating box and tif files using text2image binary from training. I collected 1 millions unicode text lines from wikipedia and 3 famous news websites. Creating box files from that contents.

Shreeshrii commented 5 years ago

@herzcthu Thanks for the info about the new font.

If you do 'replace layer' type of training, you can get by with fewer lines.

Keep posting about your progress with training.

herzcthu commented 5 years ago

I'm stuck at unicharset extractor. I get one unicharset file. But when I open that file, I'm seeing some unusual combination of characters which is not possible to exist in Burmese scripts. I wonder if this kind of junk can affect training. I've attached unicharset file I got. output_unicharset.txt

Here is some sample which is not usual

တ္မြ 1 0,255,0,255,0,0,0,0,0,0 Myanmar 514 0 514 တ္မြ   # တ္မြ [1010 1039 1019 103c ]x
တ္က်ေ 1 0,255,0,255,0,0,0,0,0,0 Myanmar 515 0 515 တ္က်ေ # တ္က်ေ [1010 1039 1000 103a 1031 ]x
ဥ္တြ 1 0,255,0,255,0,0,0,0,0,0 Myanmar 516 0 516 ဥ္တြ   # ဥ္တြ [1025 1039 1010 103c ]x
င္လေ 1 0,255,0,255,0,0,0,0,0,0 Myanmar 517 0 517 င္လေ   # င္လေ [1004 1039 101c 1031 ]x
ည္ဖြ 1 0,255,0,255,0,0,0,0,0,0 Myanmar 518 0 518 ည္ဖြ   # ည္ဖြ [100a 1039 1016 103c ]x
Shreeshrii commented 5 years ago

Use NORM_MODE="3" with unicharset extractor command.