michaelethompson / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

suggestion - add support fro additional languages written in devanagari #1331

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1.only hindi supported as of 3.02 via hin.traineddata
2.
3.

What is the expected output? What do you see instead?

What version of the product are you using? On what operating system?
git version on windows 8

Please provide any additional information below.

Please see http://ildc.in/
A number of official indian languages are written in devanagari script. 

I noticed additional langdata for marathi, nepali and sanskrit in the new 
langdata repository.

Suggest adding other languages also - as per the ILDC page, the languages are:
Bodo
Dogri
Hindi
Kashmiri - Keshur
Konkani
maithili
Marathi
Nepali
Sanskrit
Santali
Sindhi

Please see https://code.google.com/r/shreeshrii-langdata/source/browse?name=knn
for start of Konkani langdata

Original issue reported on code.google.com by shreeshrii on 8 Oct 2014 at 6:16

GoogleCodeExporter commented 9 years ago
We don't have data for any of these languages other than marathi, nepali and 
sanskrit.

I assume that they are all completely different, and just happen to use 
Devanagari as the script, so it would be pointless to try the language model 
data of one for another.

It would be possible to use your Konkani data, we would just need an 
appropriate contribution agreement. The individual form is here: 
https://cla.developers.google.com/about/google-individual
and corporate here: https://developers.google.com/open-source/cla/corporate

Original comment by theraysm...@gmail.com on 4 Nov 2014 at 9:56

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Ray,

An updated version of Konkani language data is available at 
https://code.google.com/r/shreeshrii-langdata/source/browse?name=kok

Original comment by shreeshrii on 15 Nov 2014 at 11:44