tesseract-ocr / langdata

Source training data for Tesseract for lots of languages
Apache License 2.0
827 stars 887 forks source link

Would like to help for Burmese/Myanmar language training? #13

Open herzcthu opened 9 years ago

herzcthu commented 9 years ago

Hello, I would like to help. I've already cloned all repository. How do I start?

herzcthu commented 5 years ago

Tried norm_mode 2 and 3. Both has missing vowels and medial.

n-92 commented 4 years ago

Hi,

Any further progress?

GmGniap commented 3 years ago

Please mention or let me know if something you need help for checking/fixing Burmese datasets, I'm gladly to be part of it. I've some experience in Python & Typescript. Cheers! all for helping to improve Myanmar Language in machines.

glxwine commented 1 month ago

I am new to tesseract. Recently I tried Myanmar language. It is still not perfected yet. I searched the training of data set and found this thread. However, it seems to be very old and no recent updates. I am not familiar with "how to train the data sets", but I know the language. Is there anyway that we can do to improve the Myanmar language? I also wish to understand how the training is done.

stweil commented 1 month ago

The training requires training data = lots of line images (*.png) with corresponding transcription (*.gt.txt). The original training used generated (artificial) line images, but meanwhile newer trainings for other scripts are often based on real line images from scanned books or newspapers. It's also possible to use a mix of artificial and real line images. You need as many lines as possible, and the text must cover all relevant glyphs (characters).

With enough lines for training, you can use tesstrain for the training.

Examples of training data for Latin script: https://code.bib.uni-mannheim.de/ocr-d/GT4HistOCR/src/branch/master/dta19/1827-heine_lieder.

Examples of training steps: https://github.com/UB-Mannheim/tesstrain/wiki/.

Make sure to document your training process and to publish your training data if you want to submit the result for the inclusion in the tesseract-ocr repositories.

glxwine commented 4 weeks ago

Thank you for providing the steps related to training. I think I will have to try a lot to understand the steps.

For the moment, I am more on to providing the training data (.png and .gt.txt). I know that even that one will take a lot of pairs. However, I am willing to do more on that if there is anyone who can use these to train it. I believe that Myanmar script (image) is not much complicated (like the German script example). Myanmar words are generally of the same shape except the size might proportionately increase/decrease. If the basic ones can be identified, the result will be improved. That is what I think. Apologies, if what I said is too simple. What I meant is that myanmar language shapes are quite consistent and different styles are rarely used, and also I am willing to help with (.png and .gt.txt) if I were given more detail requirements for providing these.

thanks and best regards,

On Fri, 21 Jun 2024 at 17:57, Stefan Weil @.***> wrote:

The training requires training data = lots of line images (.png) with corresponding transcription (.gt.txt). The original training used generated (artificial) line images, but meanwhile newer trainings for other scripts are often based on real line images from scanned books or newspapers. It's also possible to use a mix of artificial and real line images. You need as many lines as possible, and the text must cover all relevant glyphs (characters).

With enough lines for training, you can use tesstrain https://github.com/tesseract-ocr/tesstrain/ for the training.

Examples of training data for Latin script: https://code.bib.uni-mannheim.de/ocr-d/GT4HistOCR/src/branch/master/dta19/1827-heine_lieder .

Examples of training steps: https://github.com/UB-Mannheim/tesstrain/wiki/ .

— Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/13#issuecomment-2182567714, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBGCV7H2E22XXWCBXPN2OSDZIQE3LAVCNFSM6AAAAABJRVPIOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBSGU3DONZRGQ . You are receiving this because you commented.Message ID: @.***>