Closed HURIMOZ closed 7 months ago
Did you try Tesseract 4.0 with 'Latin' or 'lat' traineddata?
https://github.com/tesseract-ocr/tessdata_best https://github.com/tesseract-ocr/tessdata_fast
Hi, thanks for your reply. I'm running Tesseract 3.03 with Leptonica, not from source code, on Ubuntu 14. Can I install the latin traindata with this?
@HURIMOZ You can install the ppa for Tesseract4.0alpha for Ubuntu 14 from Alex's ppa - please see https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM#400-alpha-ppa
The traineddata files referred by Amit will work with those.
In fact I don't need trained data for latin. I just need the system to recognize the Latin Extended-A script so it can render the macrons (diacritics) over the vowels: ā, ē, ī, ō, ū, Ā, Ē, Ī, Ō, Ū. Currently the system renders these vowels without the macrons, and my images are of very good quality.
Please make a list of the additional characters needed, if whole extended-a range is not needed.
On 16-Oct-2017 5:38 AM, "Huri Translations" notifications@github.com wrote:
In fact I don't need trained data for latin. I just need the system to recognize the Latin Extended-A script so it can render the macrons (diacritics) over the vowels: ā, ē, ī, ō, ū, Ā, Ē, Ī, Ō, Ū. Currently the system renders these vowels without the macrons, and my images are of very good quality.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/97#issuecomment-336752245, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o15DCQLcKSbekblxEd57641gLHl2ks5ssp6NgaJpZM4P5tCt .
I just need these ten characters: ā, ē, ī, ō, ū, Ā, Ē, Ī, Ō, Ū. Thanks
Hi, did you do something particular with these characters? Are they now included in a language pack?
You can try your own training. Otherwise you have to wait for @theraysmith to upload new langdata, traineddata etc.
@HURIMOZ Please try https://github.com/tesseract-ocr/tessdata_fast/raw/master/ton.traineddata for TONGA.
It has support for ā, ē and Ā, Ē.
@theraysmith Still needed support for the following for Polynesian Languages
ī, ō, ū, Ī, Ō, Ū.
In fact I don't need trained data for latin.
Latin.traineddata is for Latin script (not Latin language) and its unicharset has ā, ē, ī, ō, ū, Ā, Ē, Ī, Ō, Ū.
Please try with 4.00 version of tesseract.
Iʻm using ubuntu 14 so canʻt use tesseract 4.00
Tamatoa AUDOUIN +689 89 205 483 +1 (213) 457 3137 info@huri-translations.pf www.huri-translations.pf The Power of Languages & Polynesian Imagery Huri Translations PO BOX 365 Maharepa 98728 Mo'orea, French Polynesia N° TAHITI: 876649
This message and its attachments may contain legally privileged or confidential information. It is intended solely for the named addressee. If you are not the addressee indicated in this message (or responsible for delivery of the message to the addressee), you may not copy or deliver this message or its attachments to anyone. Rather, you should permanently delete this message and its attachments and kindly notify the sender by reply e-mail. Any content of this message and its attachments which does not relate to the official business of the sending company must be taken not to have been sent or endorsed by that company or any of its related entities. No warranty is made that the e-mail or attachment(s) are free from computer virus or other defect.On Shreeshrii notifications@github.com, Feb 23, 2018 00:00 wrote: In fact I don't need trained data for latin.
Latin.traineddata is for Latin script (not Latin language) and its unicharset has ā, ē, ī, ō, ū, Ā, Ē, Ī, Ō, Ū. Please try with 4.00 version of tesseract.
—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or mute the thread.
{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/tesseract-ocr/langdata","title":"tesseract-ocr/langdata","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/tesseract-ocr/langdata"}},"updates":{"snippets":[{"icon":"PERSON","message":"@Shreeshrii in #97: \u003eIn fact I don't need trained data for latin.\r\n\r\nLatin.traineddata is for Latin script (not Latin language) and its unicharset has ā, ē, ī, ō, ū, Ā, Ē, Ī, Ō, Ū.\r\n\r\nPlease try with 4.00 version of tesseract."}],"action":{"name":"View Issue","url":"https://github.com/tesseract-ocr/langdata/issues/97#issuecomment-367964216"}}}
@HURIMOZ
As mentioned earlier, You can install the ppa for Tesseract4.0 for Ubuntu 14 from Alex's ppa - please see https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM#400-alpha-ppa
I do not think there will be changes made for tesseract 3.0x traineddata files by Google. If you plan to use legacy tesseract, then you can try training for your particular requirements.
Hi, we work with Polynesian languages and we need to have the Latin Extended-A script installed. Thanks in advance for your reply, Tamatoa