Open gindrawan opened 4 years ago
Hi @gindrawan, with jTessBoxEditor you will get a recognition model which uses the old legacy recognizer, but not the LSTM one.
For training LSTM, you need a large number of ground truth data, that means pairs of line images and text files with the corresponding text. You can use generated images by rendering the text with a Balinese font, and you can also use scans from Balinese publications (books, newspapers, ...) where you have to extract the lines and transcribe the text. Ideally both kinds of images are available.
Are there any converters from Bali Simbar Dwijendra to Unicode?
Are there any converters from Bali Simbar Dwijendra to Unicode?
As far as I know, there is no such converter. I found Vimala font with glyph shape quite close to Bali Simbar Dwijendra font, as I mentioned at https://github.com/tesseract-ocr/langdata/issues/126.
Hi @Shreeshrii ,
Based on your tesseract code base changing in
if I insert let say "Noto Sans Balinese" or "Vimala" at line 608 (after ""Prada" \") what other else critically I need to add/change to make the learning on? (Of corse, my eyes are also look at the other 2 files that you've changed)
https://github.com/tesseract-ocr/tesseract/commit/0eb7be1cd1707931abd77903793bf966a6640d58#diff-eaafd22a79065f5b8d28318d482e650d https://github.com/tesseract-ocr/tesseract/commit/7957288fd5502551b6c7f073c5f4ecd1f0b11dd8#diff-eaafd22a79065f5b8d28318d482e650d
I'm still on "error and trial" mode of the learning process based on your https://github.com/Shreeshrii/tessdata_jav_java. I think before reach at that option, some light from you would be useful. Thanks...
jav_java was done more than a year ago.
Now there is also possibility of training from line images. See tesseract-ocr/tesstrain repo.
Please wait for a day or two. I am in the process of setting up something for Balinese that you can then extend with your training text.
It is possible that no changes will be required in tesseract codebase.
It will also be useful if you can create ground truth transcription in Unicode for at least 5 scanned page images from books which can be used for validating the training.
You can also create a few hundred line images with transcription for fine-tuning of traineddata created with synthetic images.
On Tue, Mar 24, 2020, 09:58 gindrawan notifications@github.com wrote:
Hi @Shreeshrii https://github.com/Shreeshrii ,
Based on your tesseract code base changing in
tesseract-ocr/tesseract@b34cf9d#diff-eaafd22a79065f5b8d28318d482e650d https://github.com/tesseract-ocr/tesseract/commit/b34cf9d424e88cd09aaa193697127c90ff76e0ce#diff-eaafd22a79065f5b8d28318d482e650d
if I insert let say "Noto Sans Balinese" or "Vimala" at line 608 (after ""Prada" ") what other else critically I need to add/change to make the learning on? (Of corse, my eyes are also look at the other 2 files that you've changed)
tesseract-ocr/tesseract@0eb7be1#diff-eaafd22a79065f5b8d28318d482e650d https://github.com/tesseract-ocr/tesseract/commit/0eb7be1cd1707931abd77903793bf966a6640d58#diff-eaafd22a79065f5b8d28318d482e650d tesseract-ocr/tesseract@7957288#diff-eaafd22a79065f5b8d28318d482e650d https://github.com/tesseract-ocr/tesseract/commit/7957288fd5502551b6c7f073c5f4ecd1f0b11dd8#diff-eaafd22a79065f5b8d28318d482e650d
I'm still on "error and trial" mode of the learning process based on your https://github.com/Shreeshrii/tessdata_jav_java. I think before reach at that option, some light from you would be useful. Thanks...
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/152#issuecomment-603007797, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I47XDQVO4DC7HPAHP3RJAZHRANCNFSM4LM4TXMQ .
jav_java was done more than a year ago. Now there is also possibility of training from line images. See tesseract-ocr/tesstrain repo. Please wait for a day or two. I am in the process of setting up something for Balinese that you can then extend with your training text. It is possible that no changes will be required in tesseract codebase. It will also be useful if you can create ground truth transcription in Unicode for at least 5 scanned page images from books which can be used for validating the training. You can also create a few hundred line images with transcription for fine-tuning of traineddata created with synthetic images. … On Tue, Mar 24, 2020, 09:58 gindrawan @.> wrote: Hi @Shreeshrii https://github.com/Shreeshrii , Based on your tesseract code base changing in @.#diff-eaafd22a79065f5b8d28318d482e650d [tesseract-ocr/tesseract@b34cf9d#diff-eaafd22a79065f5b8d28318d482e650d](https://github.com/tesseract-ocr/tesseract/commit/b34cf9d424e88cd09aaa193697127c90ff76e0ce#diff-eaafd22a79065f5b8d28318d482e650d) if I insert let say "Noto Sans Balinese" or "Vimala" at line 608 (after ""Prada" ") what other else critically I need to add/change to make the learning on? (Of corse, my eyes are also look at the other 2 files that you've changed) @.#diff-eaafd22a79065f5b8d28318d482e650d [tesseract-ocr/tesseract@0eb7be1#diff-eaafd22a79065f5b8d28318d482e650d](https://github.com/tesseract-ocr/tesseract/commit/0eb7be1cd1707931abd77903793bf966a6640d58#diff-eaafd22a79065f5b8d28318d482e650d) @.#diff-eaafd22a79065f5b8d28318d482e650d [tesseract-ocr/tesseract@7957288#diff-eaafd22a79065f5b8d28318d482e650d](https://github.com/tesseract-ocr/tesseract/commit/7957288fd5502551b6c7f073c5f4ecd1f0b11dd8#diff-eaafd22a79065f5b8d28318d482e650d) I'm still on "error and trial" mode of the learning process based on your https://github.com/Shreeshrii/tessdata_jav_java. I think before reach at that option, some light from you would be useful. Thanks... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#152 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I47XDQVO4DC7HPAHP3RJAZHRANCNFSM4LM4TXMQ .
Thank you @Shreeshrii Here they are scanned page images from book (quick search from the Internet) with various image type and size. I still prepare for the synthetic images (in Noto Sans/Serif Balinese and Vimala), hope can be posted this day or tommorow.
Another thing, if the trained data successfuly generated, is that compatible for Tesseract4Android (https://github.com/adaptech-cz/Tesseract4Android) ? Since they require trained data file at https://github.com/tesseract-ocr/tessdata/tree/4.0.0
Just images are not enough. What is needed is the correct (ground truth) text in unicode format for each of those images.
So, the files should be 001.png and 001.gt.txt . Same basename but .gt.txt for the unicode text for each.
For a work in progress, see https://github.com/Shreeshrii/tesstrain-bali/tree/master/test
I need the correct text for the images so that it can be compared with the OCRed text to verify accuracy on actual images.
On Tue, Mar 24, 2020 at 1:56 PM gindrawan notifications@github.com wrote:
javjava was done more than a year ago. Now there is also possibility of training from line images. See tesseract-ocr/tesstrain repo. Please wait for a day or two. I am in the process of setting up something for Balinese that you can then extend with your training text. It is possible that no changes will be required in tesseract codebase. It will also be useful if you can create ground truth transcription in Unicode for at least 5 scanned page images from books which can be used for validating the training. You can also create a few hundred line images with transcription for fine-tuning of traineddata created with synthetic images. … <#m-1197623344891217353_> On Tue, Mar 24, 2020, 09:58 gindrawan @.> wrote: Hi @Shreeshrii https://github.com/Shreeshrii https://github.com/Shreeshrii https://github.com/Shreeshrii , Based on your tesseract code base changing in @.#diff-eaafd22a79065f5b8d28318d482e650d < tesseract-ocr/tesseract@b34cf9d#diff-eaafd22a79065f5b8d28318d482e650d https://github.com/tesseract-ocr/tesseract/commit/b34cf9d424e88cd09aaa193697127c90ff76e0ce#diff-eaafd22a79065f5b8d28318d482e650d> if I insert let say "Noto Sans Balinese" or "Vimala" at line 608 (after ""Prada" ") what other else critically I need to add/change to make the learning on? (Of corse, my eyes are also look at the other 2 files that you've changed) @.#diff-eaafd22a79065f5b8d28318d482e650d <tesseract-ocr/tesseract@0eb7be1#diff-eaafd22a79065f5b8d28318d482e650d https://github.com/tesseract-ocr/tesseract/commit/0eb7be1cd1707931abd77903793bf966a6640d58#diff-eaafd22a79065f5b8d28318d482e650d> @.#diff-eaafd22a79065f5b8d28318d482e650d < tesseract-ocr/tesseract@7957288#diff-eaafd22a79065f5b8d28318d482e650d https://github.com/tesseract-ocr/tesseract/commit/7957288fd5502551b6c7f073c5f4ecd1f0b11dd8#diff-eaafd22a79065f5b8d28318d482e650d> I'm still on "error and trial" mode of the learning process based on your https://github.com/Shreeshrii/tessdata_jav_java. I think before reach at that option, some light from you would be useful. Thanks... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#152 (comment) https://github.com/tesseract-ocr/langdata/issues/152#issuecomment-603007797>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I47XDQVO4DC7HPAHP3RJAZHRANCNFSM4LM4TXMQ .
Thank you @Shreeshrii https://github.com/Shreeshrii Here they are scanned page images from book (quick search from the Internet) with various image type and size. I still prepare for the synthetic images (in Noto Sans/Serif Balinese and Vimala), hope can be posted this day or tommorow.
Another thing, if the trained data successfuly generated, is that compatible for Tesseract4Android ( https://github.com/adaptech-cz/Tesseract4Android) ? Since they require trained data file at https://github.com/tesseract-ocr/tessdata/tree/4.0.0
balinese-script-images-v1.zip https://github.com/tesseract-ocr/langdata/files/4374016/balinese-script-images-v1.zip
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/152#issuecomment-603097913, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I4YLC4A3EIMNX77KYTRJBVBXANCNFSM4LM4TXMQ .
--
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
Just images are not enough. What is needed is the correct (ground truth) text in unicode format for each of those images. So, the files should be 001.png and 001.gt.txt . Same basename but .gt.txt for the unicode text for each. For a work in progress, see https://github.com/Shreeshrii/tesstrain-bali/tree/master/test I need the correct text for the images so that it can be compared with the OCRed text to verify accuracy on actual images. … On Tue, Mar 24, 2020 at 1:56 PM gindrawan @.> wrote: javjava was done more than a year ago. Now there is also possibility of training from line images. See tesseract-ocr/tesstrain repo. Please wait for a day or two. I am in the process of setting up something for Balinese that you can then extend with your training text. It is possible that no changes will be required in tesseract codebase. It will also be useful if you can create ground truth transcription in Unicode for at least 5 scanned page images from books which can be used for validating the training. You can also create a few hundred line images with transcription for fine-tuning of traineddata created with synthetic images. … <#m-1197623344891217353_> On Tue, Mar 24, 2020, 09:58 gindrawan @.> wrote: Hi @Shreeshrii https://github.com/Shreeshrii https://github.com/Shreeshrii https://github.com/Shreeshrii , Based on your tesseract code base changing in @.#diff-eaafd22a79065f5b8d28318d482e650d < @.#diff-eaafd22a79065f5b8d28318d482e650d [tesseract-ocr/tesseract@b34cf9d#diff-eaafd22a79065f5b8d28318d482e650d](https://github.com/tesseract-ocr/tesseract/commit/b34cf9d424e88cd09aaa193697127c90ff76e0ce#diff-eaafd22a79065f5b8d28318d482e650d)> if I insert let say "Noto Sans Balinese" or "Vimala" at line 608 (after ""Prada" ") what other else critically I need to add/change to make the learning on? (Of corse, my eyes are also look at the other 2 files that you've changed) @.*#diff-eaafd22a79065f5b8d28318d482e650d @.#diff-eaafd22a79065f5b8d28318d482e650d [tesseract-ocr/tesseract@0eb7be1#diff-eaafd22a79065f5b8d28318d482e650d](https://github.com/tesseract-ocr/tesseract/commit/0eb7be1cd1707931abd77903793bf966a6640d58#diff-eaafd22a79065f5b8d28318d482e650d)> @.#diff-eaafd22a79065f5b8d28318d482e650d < **@.***#diff-eaafd22a79065f5b8d28318d482e650d [tesseract-ocr/tesseract@7957288#diff-eaafd22a79065f5b8d28318d482e650d](https://github.com/tesseract-ocr/tesseract/commit/7957288fd5502551b6c7f073c5f4ecd1f0b11dd8#diff-eaafd22a79065f5b8d28318d482e650d)> I'm still on "error and trial" mode of the learning process based on your https://github.com/Shreeshrii/tessdata_jav_java. I think before reach at that option, some light from you would be useful. Thanks... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#152 (comment) <#152 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I47XDQVO4DC7HPAHP3RJAZHRANCNFSM4LM4TXMQ . Thank you @Shreeshrii https://github.com/Shreeshrii Here they are scanned page images from book (quick search from the Internet) with various image type and size. I still prepare for the synthetic images (in Noto Sans/Serif Balinese and Vimala), hope can be posted this day or tommorow. Another thing, if the trained data successfuly generated, is that compatible for Tesseract4Android ( https://github.com/adaptech-cz/Tesseract4Android) ? Since they require trained data file at https://github.com/tesseract-ocr/tessdata/tree/4.0.0 balinese-script-images-v1.zip https://github.com/tesseract-ocr/langdata/files/4374016/balinese-script-images-v1.zip — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#152 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I4YLC4A3EIMNX77KYTRJBVBXANCNFSM4LM4TXMQ .
____ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
Sorry, I forgot about the txt. May be need longer time for that. Ok then, I still prepare for the synthetic images, I think faster to make it ready. One question, how many words needed per line ?
This is small pair image and text file using Noto Serif Balinese, I took them from https://en.wikipedia.org/wiki/Balinese_script. Hope can be used for now.. small-pair-image-text.zip
Oh, I forgot. Do the image need its box file or only the unicode text?
Just the unicode text.
On Tue, Mar 24, 2020 at 6:02 PM gindrawan notifications@github.com wrote:
Oh, I forgot. Do the image need its box file or only the unicode text?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/152#issuecomment-603211813, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I3BFPEY6ARA35XUQ2TRJCR6NANCNFSM4LM4TXMQ .
--
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
Hi @Shreeshrii,
It seems more time I need to prepare the training data (1-2 more days).
Meanwhile, I just realize that there are kind of training data in page images (https://github.com/topherseance/javanese-aksara-training-text) and line images.
Based on your previous answer, it seems you prefer line images? What happened with page images?
On preparing line images in my case, it seems more effort because a page image need to be converted to several line images. But if training result will better enough, it's Ok then.
At the attachment I have sample of my page image with its ground truth text. Is that Ok before I proceed further to line images?
Are you preparing synthetic data using fonts or using actual images similar to what needs to be recognised later?
https://github.com/Shreeshrii/tesstrain-bali/tree/master/langdata
I had done a training run with 4-5 fonts.
Are you preparing synthetic data using fonts or using actual images similar to what needs to be recognised later?
I am preparing about 5 thousands word (the remaining about 29 thousands word still on verification on the unicode) for synthetic data using Noto Serif Balinese, just download the latest font, updated 3 days ago (https://github.com/googlefonts/noto-fonts/tree/master/phaseIII_only/unhinted/ttf/NotoSerifBalinese). Somehow more updated than Noto Sans Balinese.
Those 5 thousands words has already transformed into 101 page images, each contains 12 line training texts, each line about 5-10 words. Need a little more time to finalized it. If go into line images, well.. need more extra time.
After that I am going to Vimala with the same unicode with Noto Serif Balinese. Vimala more likely needed for actual images recognition.
The most needed for actual images recognition, Bali Simbar Dwijendra (BSD) we plan later since using non-balinese unicode, so more time and effort to prepare the training data. Actually, if involved BSD, the balinese script recognition app would has 2 option for post processing: unicode and non-unicode (I imagine some switch radio button to select before recognition).
Generation of synthetic data is not an issue. It is actually quite easy to generate page images or line images given a training text and set of fonts.
See https://github.com/Shreeshrii/tesstrain-bali/tree/master/gt/bali-Vimala which has line images and their groundtruth generated from random sanskrit text (https://github.com/Shreeshrii/tesstrain-bali/blob/master/langdata/bali.training_text) converted to Balinese script. This is not showing up correctly in my web brower, but it is ok when I apply the Vimala font in notepad++.
LSTM training works on line images, so it is better to do line images. But this can be done easily by a computer.
It seems to me that you are just taking a word list and generating text lines and images from that. Instead you should actually be using sentences and paragraphs and phrases along with punctuation similar to the pages that need to be recognized.
The most needed for actual images recognition, Bali Simbar Dwijendra (BSD) we plan later since using non-balinese unicode,
If there was any script which maps from BSD to Unicode then it can probably be handled programatically. Otherwise you should take page in BSD and transcribe it in Unicode.
When I asked for page images for testing, I meant some sample actual images (in BSD) .
I am generating images in five fonts: Kadiri Noto Sans Balinese Noto Serif Balinese Pustaka Bali Vimala
However, if only Vimala is required, it will probably be faster to get convergence.
It's ok I think you put all of those fonts. Kadiri, Pustaka, and Vimala seem try to mimic certain different styles of ancient glyph. Moreover Vimala was also developed with BSD style reference. Noto Sans Balinese and Noto Serif Balinese seem not so many difference each other. I don't know what the consideration Google release both of them.
If there was any script which maps from BSD to Unicode then it can probably be handled programatically. Otherwise you should take page in BSD and transcribe it in Unicode.
@Shreeshrii , I just make any map from BSD to Balinese Unicode, perhaps it useful. bsdcode.2.balineseunicode.txt
Is http://www.unicode.org/udhr/d/udhr_ban.html in BSD?
I did a simple substitution using sed
to convert the text from there to Unicode using the mapping you suggested. I don't think it is correct. B
is not converted, also some signs don't seem right. I don't know the language to verify.
Is http://www.unicode.org/udhr/d/udhr_ban.html in BSD?
It is in Balinese Latin (like Javanese Latin using convention name "java"; and its Javanese Script using "java-jav") . From there we can convert it to many Balinese Script (BSD, Vimala, Noto Serif Balinese, etc) but need some rule-based text preprocessing first. For an example: First word "Sami" at the second line must be convert for
At the reverse process (Balinese script to Balinese Latin), actually I don't know, how to make this work in Tesseract, as I illustrated it at the attachment.
Oh, for Balinese Script to Balinese Latin at the illustration file "the input" means "the image input"
This is my libre office screenshot. You must install bali simbar dwijendra font at your linux OS.
The way tesseract (lstm version) works, the image will be recognised as Unicode text which will render correctly with Unicode Balinese fonts. So, both Vimala and Noto fonts should be able to render the same output.
I did a simple substitution using
sed
to convert the text from there to Unicode using the mapping you suggested. I don't think it is correct.B
is not converted, also some signs don't seem right. I don't know the language to verify.
Hi @Shreeshrii , I just improve the bsd code to unicode mapping https://github.com/gindrawan/balinese-bsdcode-2-unicode based on your sed file (at the attachement, I gave status OK, REV, and ADDED. Not all of added mapping were put it there, see the link).
I have tested bali1.traineddata from https://github.com/Shreeshrii/tesstrain-bali/tree/master/data using a simple BSD word image but the result is still not right (the file is at the attachment with gt text file using bsdcode and unicode for checking). Perhaps because not yet learned using BSD.
Related to udhr.latn.txt, if you want to transliterate it to BSD-style Balinese script, you can try android app (still not prefect though): https://play.google.com/store/apps/details?id=id.ac.undiksha.aksarabalisd&hl=en
You have given link to apps that convert from Latn to BSD as well as Noto (Unicode) for Balinese.
What will be helpful, if you want to train for BSD, is you can send me two text files, one in BSD and one in Noto, for the same Balinese text. Similar to file you sent earlier, but that was just one word.
On Mon, Mar 30, 2020 at 10:54 AM gindrawan notifications@github.com wrote:
I did a simple substitution using sed to convert the text from there to Unicode using the mapping you suggested. I don't think it is correct. B is not converted, also some signs don't seem right. I don't know the language to verify.
Hi @Shreeshrii https://github.com/Shreeshrii , I just improve the bsd code to unicode mapping https://github.com/gindrawan/balinese-bsdcode-2-unicode based on your sed file (at the attachement, I gave status OK, REV, and ADDED. Not all of added mapping were put it there, see the link).
I have tested bali1.traineddata from https://github.com/Shreeshrii/tesstrain-bali/tree/master/data using a simple BSD word image but the result is still not right (the file is at the attachment with gt text file using bsdcode and unicode for checking). Perhaps because not yet learned using BSD.
Related to udhr.latn.txt, if you want to transliterate it to BSD-style Balinese script, you can try android app (still not prefect though): https://play.google.com/store/apps/details?id=id.ac.undiksha.aksarabalisd&hl=en
bsd2unicode.sed.txt https://github.com/tesseract-ocr/langdata/files/4400924/bsd2unicode.sed.txt bakta.zip https://github.com/tesseract-ocr/langdata/files/4400927/bakta.zip
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/152#issuecomment-605789077, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37IZS3T3OLJUIPQWFBV3RKAUKRANCNFSM4LM4TXMQ .
--
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
I just make them but still in small size since quite manual to generate them. https://github.com/gindrawan/balinese-script-training I am thinking how to speed it up...
How will you train tesseract wilth such data? I guess you will feed it up with generated image (from related BSD gt text file) and mapping it to NSB gt text file.
Hi, I want to develop an OCR for Balinese Script (https://en.wikipedia.org/wiki/Balinese_script) using Tesseract 4.0 and tool jTessBoxEditor 2.2.1 (still not support LSTM?).
There are two font involved (at the attachment)
I wanto accomodate both type of fonts with priority to Bali Simbar Dwijendra. Sorry I am new to Tesseract and the question is how do I start with it?
Thank you very much for your kind attention.
Best regards, Indra
bali-simbar-dj-noto-serif-balinese.zip