Addition of trained data for Serbian Cyrillic to Tesseract's repository

kcobra / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr

Other

0 stars 0 forks source link

Addition of trained data for Serbian Cyrillic to Tesseract's repository #1373

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago

At this moment, Tesseract does not support trained data to recognize Serbian 
Cyrillic alphabet. 

In order to remedy this situation, I enclose trained data for Serbian Cyrillic 
alphabet. For those not familiar with Serbian Cyrillic, this page gives more 
information:

http://en.wikipedia.org/wiki/Serbian_Cyrillic_alphabet

The attached file has been created using Bourne Shell script "tesstrain.sh", 
available in Tesseract source code repository (directory "training").

I would also like to propose renaming of so-called "Serbian Latin" to srp-lat, 
keeping "srp" label for Serbian Cyrillic alphabet.

Original issue reported on code.google.com by puramoca...@gmail.com on 4 Nov 2014 at 7:41

Attachments:

srp.traineddata

GoogleCodeExporter commented 9 years ago

https://groups.google.com/forum/#!topic/tesseract-dev/0BEK1gIXiIQ

Original comment by shreeshrii on 11 Nov 2014 at 1:53

GoogleCodeExporter commented 9 years ago

I can confirm that this trained data works well with Cyrillic alphabet with one 
exception - quotes ("") makes tesseract segfault.

I attached example picture that causes segfault with this command:
$ tesseract test.png out.txt -l srp

Original comment by medicmom...@gmail.com on 8 Apr 2015 at 4:35

Attachments:

test.png

GoogleCodeExporter commented 9 years ago

Please create own project repository for your language data (have a look at 
tesseract-georgian[1] for example) and post here link.

[1] https://github.com/ddohler/tesseract-georgian

Original comment by zde...@gmail.com on 12 Apr 2015 at 2:46

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

Project repository for Tesseract-OCR files created for recognition of Serbian 
Cyrillic script:

https://github.com/strn/tesseract-serbian

Original comment by puramoca...@gmail.com on 14 Apr 2015 at 10:56

GoogleCodeExporter commented 9 years ago

Thanks. I added to wiki 
https://code.google.com/p/tesseract-ocr/wiki/AddOns#Community_training_projects

Original comment by zde...@gmail.com on 16 Apr 2015 at 7:24

Changed state: Fixed