tesseract-ocr / langdata

Source training data for Tesseract for lots of languages
Apache License 2.0
827 stars 886 forks source link

Add Javanese Script for jav-java #126

Open Shreeshrii opened 6 years ago

Shreeshrii commented 6 years ago

Originally posted in forum

https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/8r8YOQgTBT4/xHpCTp9DAwAJ

From: Christopher Imantaka Halim

> Hi,
> 
> I want to develop an OCR for Javanese Script / Aksara.
> https://en.wikipedia.org/wiki/Javanese_script
> 
> Plan on using Tesseract version 4.0
> I've read the wiki but somehow got confused.
> 
> What do I need to prepare, to start the bare minimum training process? (for Tesseract 4.0)
> In some other thread someone said that training using image files are not supported yet.
> Also found out that box file/tiff pairs are not supported also.
> (I did try making one box file, using this online tool: https://pp19dd.com/tesseract-ocr-chopper/)
> 
> Do we have an example of the training "inputs" somewhere on the github projects?
> 
> Sorry if this is a stupid question, I'm a newbie. :)
> 
> Thanks before
bennylin commented 3 years ago

@shreeshrii & @topherseance: there are more than 20 Javanese script fonts available here: https://bennylin.github.io/keyboards/jawa-fonts.html

Shreeshrii commented 3 years ago

@bennylin Are these Unicode fonts?

bennylin commented 3 years ago

Yes

Shreeshrii commented 3 years ago

Are there any labelled datasets with scanned images and their Unicode groundtruth transcription that can be used for training/testing tesseract's jav-java traineddata?

What accuracy did the UKDW ocr achieve?

bennylin commented 3 years ago

I'm not in the loop for the research. You might want to contact Dr. Lucia Krisnawati for that.