tesseract-ocr / langdata

Source training data for Tesseract for lots of languages
Apache License 2.0
827 stars 886 forks source link

Add Latin Extended-A script for Polynesian languages #97

Closed HURIMOZ closed 2 months ago

HURIMOZ commented 6 years ago

Hi, we work with Polynesian languages and we need to have the Latin Extended-A script installed. Thanks in advance for your reply, Tamatoa

amitdo commented 6 years ago

Did you try Tesseract 4.0 with 'Latin' or 'lat' traineddata?

https://github.com/tesseract-ocr/tessdata_best https://github.com/tesseract-ocr/tessdata_fast

HURIMOZ commented 6 years ago

Hi, thanks for your reply. I'm running Tesseract 3.03 with Leptonica, not from source code, on Ubuntu 14. Can I install the latin traindata with this?

amitdo commented 6 years ago

See here: https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#data-files-for-version-304305

Shreeshrii commented 6 years ago

@HURIMOZ You can install the ppa for Tesseract4.0alpha for Ubuntu 14 from Alex's ppa - please see https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM#400-alpha-ppa

The traineddata files referred by Amit will work with those.

HURIMOZ commented 6 years ago

In fact I don't need trained data for latin. I just need the system to recognize the Latin Extended-A script so it can render the macrons (diacritics) over the vowels: ā, ē, ī, ō, ū, Ā, Ē, Ī, Ō, Ū. Currently the system renders these vowels without the macrons, and my images are of very good quality.

Shreeshrii commented 6 years ago

Please make a list of the additional characters needed, if whole extended-a range is not needed.

On 16-Oct-2017 5:38 AM, "Huri Translations" notifications@github.com wrote:

In fact I don't need trained data for latin. I just need the system to recognize the Latin Extended-A script so it can render the macrons (diacritics) over the vowels: ā, ē, ī, ō, ū, Ā, Ē, Ī, Ō, Ū. Currently the system renders these vowels without the macrons, and my images are of very good quality.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/97#issuecomment-336752245, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o15DCQLcKSbekblxEd57641gLHl2ks5ssp6NgaJpZM4P5tCt .

HURIMOZ commented 6 years ago

I just need these ten characters: ā, ē, ī, ō, ū, Ā, Ē, Ī, Ō, Ū. Thanks

HURIMOZ commented 6 years ago

Hi, did you do something particular with these characters? Are they now included in a language pack?

Shreeshrii commented 6 years ago

You can try your own training. Otherwise you have to wait for @theraysmith to upload new langdata, traineddata etc.

Shreeshrii commented 6 years ago

@HURIMOZ Please try https://github.com/tesseract-ocr/tessdata_fast/raw/master/ton.traineddata for TONGA.

It has support for ā, ē and Ā, Ē.

@theraysmith Still needed support for the following for Polynesian Languages

ī, ō, ū, Ī, Ō, Ū.

Shreeshrii commented 6 years ago

In fact I don't need trained data for latin.

Latin.traineddata is for Latin script (not Latin language) and its unicharset has ā, ē, ī, ō, ū, Ā, Ē, Ī, Ō, Ū.

Please try with 4.00 version of tesseract.

HURIMOZ commented 6 years ago

Iʻm using ubuntu 14 so canʻt use tesseract 4.00

Tamatoa AUDOUIN +689 89 205 483 +1 (213) 457 3137 info@huri-translations.pf www.huri-translations.pf The Power of Languages & Polynesian Imagery Huri Translations PO BOX 365 Maharepa 98728 Mo'orea, French Polynesia N° TAHITI: 876649

This message and its attachments may contain legally privileged or confidential information. It is intended solely for the named addressee. If you are not the addressee indicated in this message (or responsible for delivery of the message to the addressee), you may not copy or deliver this message or its attachments to anyone. Rather, you should permanently delete this message and its attachments and kindly notify the sender by reply e-mail. Any content of this message and its attachments which does not relate to the official business of the sending company must be taken not to have been sent or endorsed by that company or any of its related entities. No warranty is made that the e-mail or attachment(s) are free from computer virus or other defect.On Shreeshrii notifications@github.com, Feb 23, 2018 00:00 wrote: In fact I don't need trained data for latin.

Latin.traineddata is for Latin script (not Latin language) and its unicharset has ā, ē, ī, ō, ū, Ā, Ē, Ī, Ō, Ū. Please try with 4.00 version of tesseract.

—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or mute the thread.

{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/tesseract-ocr/langdata","title":"tesseract-ocr/langdata","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/tesseract-ocr/langdata"}},"updates":{"snippets":[{"icon":"PERSON","message":"@Shreeshrii in #97: \u003eIn fact I don't need trained data for latin.\r\n\r\nLatin.traineddata is for Latin script (not Latin language) and its unicharset has ā, ē, ī, ō, ū, Ā, Ē, Ī, Ō, Ū.\r\n\r\nPlease try with 4.00 version of tesseract."}],"action":{"name":"View Issue","url":"https://github.com/tesseract-ocr/langdata/issues/97#issuecomment-367964216"}}}

Shreeshrii commented 6 years ago

@HURIMOZ

As mentioned earlier, You can install the ppa for Tesseract4.0 for Ubuntu 14 from Alex's ppa - please see https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM#400-alpha-ppa

I do not think there will be changes made for tesseract 3.0x traineddata files by Google. If you plan to use legacy tesseract, then you can try training for your particular requirements.