tesseract-ocr / langdata

Source training data for Tesseract for lots of languages
Apache License 2.0
837 stars 888 forks source link

Add support for Armenian #67

Closed Shreeshrii closed 4 years ago

Shreeshrii commented 7 years ago

copied from: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/zn4Xd-8wKe8/B6VpQkuZAwAJ

Dear all,

I am trying tesseart recently and it is really a very good product. I would like to ask if there is any tutorial or steps about how we can add a new language support to the package? for example Armenian language.

Thank you in advance.

Regards, Vahe

Shreeshrii commented 7 years ago

Vahe, Please add the following info.

Shreeshrii commented 7 years ago

langdata has https://github.com/tesseract-ocr/langdata/blob/master/Armenian.unicharset

but no folders for armenian languages.

@theraysmith Is this one of the new languages included in your current training?

I had closed an earlier issue - https://github.com/tesseract-ocr/langdata/issues/51

Shreeshrii commented 7 years ago

http://crubadan.org/languages/hy (zip file has word frequency lists, unigrams, bigrams etc)

http://hy.wikipedia.org/

https://en.wikipedia.org/wiki/Armenian_language

https://en.wikipedia.org/wiki/Eastern_Armenian

https://en.wikipedia.org/wiki/Western_Armenian

https://en.wikipedia.org/wiki/Classical_Armenian_orthography

https://en.wikipedia.org/wiki/Armenian_orthography_reform

amitdo commented 7 years ago

https://en.wikipedia.org/wiki/Armenian_alphabet

vahenr commented 7 years ago

Thank for all comments (sorry for being late to response): Language code is: arm Modern Armenian: Eastern_Armenian For fonts please refer to this link: http://armunicode.com/en/fonts/unicode/

vahenr commented 7 years ago

For this one: Sources for primary texts in unicode the Armenian language to use for training

Do you need any Armenian text pages ?

Shreeshrii commented 7 years ago

Scans of text pages with their ground truth transcription will be useful for ocr evaluation. However, 4.0 lstm does not yet support training from these.

For lstm training it will be useful to have access to unicode text and unicode fonts . Is there an Armenian wikipedia?

On 12-Apr-2017 11:36 PM, "vahenr" notifications@github.com wrote:

For this one: Sources for primary texts in unicode the Armenian language to use for training

Do you need any Armenian text pages ?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/67#issuecomment-293660923, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o0V8xJ1Ni2idmAb2Hwqg7gB_orJAks5rvRKogaJpZM4M5tBM .

vahenr commented 7 years ago

Yes there is an Armenian wikipedia, this is the link: https://hy.wikipedia.org/wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D5%A7%D5%BB

I will try to get some unicode text resources and share it with you.

Thank you once again.

vahenr commented 7 years ago

text-v2.docx

I attached some text file Armenian unicode hope it help, if you need any more please let me know.

Shreeshrii commented 7 years ago

Thanks, I will give a try and let you know.

Shreeshrii commented 7 years ago

Attached is a zip file with arm.traineddata for use with --oem 0 i.e. legacy engine only for testing. Please give it a try, I have not done any eval on it.

I did training using the following command:


training/tesstrain.sh  \
--fonts_dir  /mnt/c/Windows/Fonts \
 --lang arm   \
 --exposures "0"    \
 --langdata_dir ../langdata \
 --tessdata_dir ../tessdata  \
 --output_dir ~/tesstutorial/arm  \
 --fontlist   "Arial" \
  "Consolas" \
  "Courier New" \
  "DejaVu Sans" \
  "DejaVu Sans Mono" \
  "DejaVu Serif" \
  "FreeMono" \
  "FreeSans" \
  "FreeSerif" \
  "Microsoft Sans Serif" \
  "Segoe UI" \
  "Sylfaen" \
  "Tahoma" \
  "Times New Roman," \
  "Trebuchet MS" \
  "Verdana" \
  "Verdana Bold" \
  "Verdana Bold Italic" \
  "Verdana Italic" 

arm.zip

Shreeshrii commented 7 years ago

Attached is an eval report using one of the training text images - arm.Sylfaen.exp0.txt

CER 2.91 WER 5.02 WER (order independent) 4.63

arm_report.zip

vahenr commented 7 years ago

Thanks a lot for the files, could you please tell me what to do exactly for the next step, and what we are missing ? Thank you very much once again.

vahenr commented 7 years ago

I did some tests, for the fist one I got: Error in pixGenHalftoneMask: pix too small: w = 270, h = 97 But the output in overall is not bad (attaching the original and the output) there some characters wrong. armeniantext armeniantext.txt

vahenr commented 7 years ago

The next test was better, no errors. fedrasansarmenian second.txt

vahenr commented 7 years ago

Waiting for your suggestions.

Shreeshrii commented 7 years ago

I can do another Legacy training for Armenian using more fonts, bold, italic and post that for u to test.

I am also trying lstm training, but that will only be an experiment on my part.

Hope @theraysmith includes it in next training.

On 15-Apr-2017 12:37 AM, "vahenr" notifications@github.com wrote:

Waiting for your suggestions.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/67#issuecomment-294216290, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o6gL1eHx4Ib-5_n8BFZyonu6LOYNks5rv8P2gaJpZM4M5tBM .

Shreeshrii commented 7 years ago

Please see attached zip file.

arm-2.zip

It has a newer arm.traineddata as well as the training_text, fonts list etc that I used. You can test so see if this is better than the earlier version - use --oem 0 since it does not have lstm traineddata.

You can do training by modifying training text etc. You will need to add arm as a valid language code in

https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh#L21

and also add a line similar to https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh#L921 for arm.

vahenr commented 7 years ago

Thank you very much once again. I will try to do the test on Monday and post the result, I tested this new one arm-2.zip got the same output no big difference.

Shreeshrii commented 7 years ago

Yes, I did not expect.much change. I just trained with more fonts, same text, legacy engine. If you notice any common errors, please note those. You maybe able to fix those using unicharambigs file.

I am trying an LSTM training, not sure whether it will give better results. Will share traineddata when completed.

On 16-Apr-2017 3:57 PM, "vahenr" notifications@github.com wrote:

Thank you very much once again. I will try to do the test on Monday and post the result, I tested this new one arm-2.zip got the same output no big difference.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/67#issuecomment-294344636, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o2jdwUpjGAa9Sbi9DLqVOhLXVwp5ks5rwe0bgaJpZM4M5tBM .

vahenr commented 7 years ago

Could you please help me with this issue: training/./tesstrain.sh --fonts_dir /root/ocr/training/Fonts --lang arm --exposures "0" --langdata_dir ../langdata --tessdata_dir ../tessdata --output_dir /root/ocr/training_output --fontlist "Aramian Normal" "Arial AM"

=== Starting training for language 'arm' ERROR: Error: arm is not a valid language code

Thank you once again.

theraysmith commented 7 years ago

See: tesseract/training/language-specific.sh The Armenian language code is hye in ISO 639-2T.

On Tue, Apr 18, 2017 at 11:56 AM, vahenr notifications@github.com wrote:

Could you please help me with this issue: training/./tesstrain.sh --fonts_dir /root/ocr/training/Fonts --lang arm --exposures "0" --langdata_dir ../langdata --tessdata_dir ../tessdata --output_dir /root/ocr/training_output --fontlist "Aramian Normal" "Arial AM"

=== Starting training for language 'arm' ERROR: Error: arm is not a valid language code

Thank you once again.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/67#issuecomment-294945913, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056RDdSegyz-fNT8NtdufEGtZyoDjKks5rxQd4gaJpZM4M5tBM .

-- Ray.

Shreeshrii commented 7 years ago

Thanks, Ray.

However, hye is marked as unusable language code. Also there is no folder for hye in langdata.

https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh#L36

Shreeshrii commented 7 years ago

@vahenr Please see earlier comment at https://github.com/tesseract-ocr/langdata/issues/67#issuecomment-294288382

You will need to add arm as a valid language code in

https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh#L21

and also add a line similar to https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh#L921 for arm.

Or as suggested by Ray, use hye as the language code.

vahenr commented 7 years ago

What do I need to put in this file: arm.training_text ? This is for the option: --langdata_dir ../langdata

Shreeshrii commented 7 years ago

https://github.com/tesseract-ocr/langdata/files/923560/arm-2.zip

The above zip file has the files that I used. Put them in a folder named arm under langdata. The training text I used has the text from the doc file you sent, Unicode text for udhr and some text copied from Wikipedia.

The wordlist is taken from crubdan site, link is given in some earlier comment in this thread.

These will be sufficient for legacy training. My trial for LSTM training were not successful. Hopefully Ray will provide new traineddata for Armenian soon.

Shreeshrii commented 7 years ago

Also download other required files from langdata repo. Read the readme file for requirements or just clone the whole repo.

Shreeshrii commented 7 years ago

See https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-%E2%80%93-tesstrain.sh for info on training.

gelinger777 commented 7 years ago

Thank you @Shreeshrii for your help in adding armenian to tessa !!!

amitdo commented 7 years ago

https://github.com/tesseract-ocr/tessdata/tree/master/best

Armenian.traineddata hye.traineddata

Shreeshrii commented 7 years ago

@vahenr @gelinger777 Please test Armenian support with the newly posted best traineddata for use with the LSTM engine

arm2arm commented 5 years ago

are there any progress on this ticket?

gurgendav commented 4 years ago

Is there any updates?

amitdo commented 4 years ago

As said before here, it is supported.

You need Tesseract 4.0 or newer version, and to download the hye.traineddata.

amitdo commented 4 years ago

@Shreeshrii, you opened this issue in 2017. I think you can close it now.