Add Javanese Script for jav-java

Shreeshrii commented 6 years ago

Originally posted in forum

https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/8r8YOQgTBT4/xHpCTp9DAwAJ

From: Christopher Imantaka Halim

> Hi,
> 
> I want to develop an OCR for Javanese Script / Aksara.
> https://en.wikipedia.org/wiki/Javanese_script
> 
> Plan on using Tesseract version 4.0
> I've read the wiki but somehow got confused.
> 
> What do I need to prepare, to start the bare minimum training process? (for Tesseract 4.0)
> In some other thread someone said that training using image files are not supported yet.
> Also found out that box file/tiff pairs are not supported also.
> (I did try making one box file, using this online tool: https://pp19dd.com/tesseract-ocr-chopper/)
> 
> Do we have an example of the training "inputs" somewhere on the github projects?
> 
> Sorry if this is a stupid question, I'm a newbie. :)
> 
> Thanks before

Shreeshrii commented 6 years ago

Collect training text in Javanese script (Unicode). You will need large number of lines, 500,000 or so to train from scratch. Or, if you can identify a current language/script supported by tesseract which is similar, then you can train by replacing layer, Try to get representative training text of about 50000 lines with 50 words each.
Collect unicode fonts which can correctly render the above text. The more fonts you have the better it will be.
Collect word frequency lists in Javanese script.
Preferably use the linux platform for doing training.
See https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#training-just-a-few-layers

You can try this as it will be faster than training from scratch.

Please post links to Javanese script related resources below.

If there is a transliterator which convertes Javanese in Latin script to Javanese script, that can be used for converting the files for lang jav as a start.

Shreeshrii commented 6 years ago

Do we have an example of the training "inputs" somewhere on the github projects?

See

https://github.com/tesseract-ocr/langdata/tree/master/jav

https://github.com/tesseract-ocr/langdata/blob/master/README.md

amitdo commented 6 years ago

before training, he should try best/fast jav.traineddata.

Shreeshrii commented 6 years ago

jav is Javanese language in Latin script.

zat kasebut lan kanthi Kategori:Tokoh ing user:OffsBlink para pedunung PL09Puryono| kaya désa
2006 90%; sisih wiwit dan papan wilayah Delengen 5 || ! Wétan, Cathetan € sawijining | saged
amarga Cathetan jaba saka Dominique jiwa. ingkang User:ZorroIII Indonesia 1] langkung NGC

He wants it in Javanese script

The Javanese script, natively known as Aksara Jawa (ꦲꦏ꧀ꦱꦫꦗꦮaksarajawa) and Hanacaraka (ꦲꦤꦕꦫꦏhanacaraka), is an abugida developed by the Javanese people to write several Austronesian languages spoken in Indonesia, primarily the Javanese language and an early form of Javanese called Kawi, as well as Sanskrit, an Indo-Aryan language used as a sacred language throughout Asia. The Javanese script is a descendant of the Brahmi script and therefore has many similarities with the modern scripts of South India and Southeast Asia. The Javanese script, along with the Balinese script, is considered the most elaborate and ornate among Brahmic scripts of Southeast Asia.[1]

This might be similar to Thai/Khmer - could try using that to train from.

Shreeshrii commented 6 years ago

http://unicode.org/udhr/d/udhr_jav_java.html

Universal Declaration of Human Rights - Javanese (Javanese)

Shreeshrii commented 6 years ago

https://jv.wikipedia.org/wiki/Parembugan:Joko_Widodo

Most of Javanese wikipedia seems to be in Latin script.

Shreeshrii commented 6 years ago

https://r12a.github.io/scripts/javanese/

https://r12a.github.io/scripts/featurelist/

amitdo commented 6 years ago

Did you unpack jav from best/fast?

topherseance commented 6 years ago

Hello, thanks a lot for your help. I appreciate it. Thanks again for providing the links to javanese scripts.

Sorry, I always thought that we need images as training data, but it is not the case for Tesseract 4.0. :)

Another question, do we have to collect all 500,000 text lines before begin the training? Can I, lets say, collect only 100 lines, then start the training? (also well aware that the result may not be good, like overfitting)

Shreeshrii commented 6 years ago

100 lines will work only for fine tuning. But you can give it a try to get familiar with training process.

Shreeshrii commented 6 years ago

Did you unpack jav from best/fast?

@amitdo I had only looked at langdata. I checked just now after your post. The unicharset in both is in Latin Script only. See below for tessdata_fast version

94
NULL 0 Common 0
Joined 7 0,255,0,255,0,0,0,0,0,0 Latin 1 0 1 Joined # Joined [4a 6f 69 6e 65 64 ]a
|Broken|0|1 f 0,255,0,255,0,0,0,0,0,0 Common 2 10 2 |Broken|0|1 # Broken
J 5 0,255,0,255,0,0,0,0,0,0 Latin 73 0 3 J  # J [4a ]A
E 5 0,255,0,255,0,0,0,0,0,0 Latin 58 0 4 E  # E [45 ]A
N 5 0,255,0,255,0,0,0,0,0,0 Latin 61 0 5 N  # N [4e ]A
I 5 0,255,0,255,0,0,0,0,0,0 Latin 66 0 6 I  # I [49 ]A
F 5 0,255,0,255,0,0,0,0,0,0 Latin 84 0 7 F  # F [46 ]A
R 5 0,255,0,255,0,0,0,0,0,0 Latin 74 0 8 R  # R [52 ]A
M 5 0,255,0,255,0,0,0,0,0,0 Latin 63 0 9 M  # M [4d ]A
A 5 0,255,0,255,0,0,0,0,0,0 Latin 60 0 10 A # A [41 ]A
G 5 0,255,0,255,0,0,0,0,0,0 Latin 65 0 11 G # G [47 ]A
: 10 0,255,0,255,0,0,0,0,0,0 Common 12 6 12 :   # : [3a ]p
P 5 0,255,0,255,0,0,0,0,0,0 Latin 67 0 13 P # P [50 ]A
L 5 0,255,0,255,0,0,0,0,0,0 Latin 72 0 14 L # L [4c ]A
T 5 0,255,0,255,0,0,0,0,0,0 Latin 59 0 15 T # T [54 ]A
U 5 0,255,0,255,0,0,0,0,0,0 Latin 68 0 16 U # U [55 ]A
B 5 0,255,0,255,0,0,0,0,0,0 Latin 76 0 17 B # B [42 ]A
, 10 0,255,0,255,0,0,0,0,0,0 Common 18 6 18 ,   # , [2c ]p
K 5 0,255,0,255,0,0,0,0,0,0 Latin 75 0 19 K # K [4b ]A
H 5 0,255,0,255,0,0,0,0,0,0 Latin 62 0 20 H # H [48 ]A
D 5 0,255,0,255,0,0,0,0,0,0 Latin 71 0 21 D # D [44 ]A
S 5 0,255,0,255,0,0,0,0,0,0 Latin 64 0 22 S # S [53 ]A
# 10 0,255,0,255,0,0,0,0,0,0 Common 23 4 23 #   # # [23 ]p
Ê 5 0,255,0,255,0,0,0,0,0,0 Latin 78 0 24 Ê # Ê [ca ]A
- 10 0,255,0,255,0,0,0,0,0,0 Common 25 3 25 -   # - [2d ]p
. 10 0,255,0,255,0,0,0,0,0,0 Common 26 6 26 .   # . [2e ]p
Y 5 0,255,0,255,0,0,0,0,0,0 Latin 69 0 27 Y # Y [59 ]A
W 5 0,255,0,255,0,0,0,0,0,0 Latin 70 0 28 W # W [57 ]A
O 5 0,255,0,255,0,0,0,0,0,0 Latin 77 0 29 O # O [4f ]A
' 10 0,255,0,255,0,0,0,0,0,0 Common 30 10 30 '  # ' [27 ]p
8 8 0,255,0,255,0,0,0,0,0,0 Common 31 2 31 8    # 8 [38 ]0
! 10 0,255,0,255,0,0,0,0,0,0 Common 32 10 32 !  # ! [21 ]p
” 10 0,255,0,255,0,0,0,0,0,0 Common 33 10 33 "  # ” [201d ]p
É 5 0,255,0,255,0,0,0,0,0,0 Latin 79 0 34 É # É [c9 ]A
? 10 0,255,0,255,0,0,0,0,0,0 Common 35 10 35 ?  # ? [3f ]p
C 5 0,255,0,255,0,0,0,0,0,0 Latin 85 0 36 C # C [43 ]A
È 5 0,255,0,255,0,0,0,0,0,0 Latin 80 0 37 È # È [c8 ]A
2 8 0,255,0,255,0,0,0,0,0,0 Common 38 2 38 2    # 2 [32 ]0
; 10 0,255,0,255,0,0,0,0,0,0 Common 39 10 39 ;  # ; [3b ]p
/ 10 0,255,0,255,0,0,0,0,0,0 Common 40 6 40 /   # / [2f ]p
( 10 0,255,0,255,0,0,0,0,0,0 Common 41 10 43 (  # ( [28 ]p
" 10 0,255,0,255,0,0,0,0,0,0 Common 42 10 42 "  # " [22 ]p
) 10 0,255,0,255,0,0,0,0,0,0 Common 43 10 41 )  # ) [29 ]p
1 8 0,255,0,255,0,0,0,0,0,0 Common 44 2 44 1    # 1 [31 ]0
3 8 0,255,0,255,0,0,0,0,0,0 Common 45 2 45 3    # 3 [33 ]0
7 8 0,255,0,255,0,0,0,0,0,0 Common 46 2 46 7    # 7 [37 ]0
“ 10 0,255,0,255,0,0,0,0,0,0 Common 47 10 47 "  # “ [201c ]p
Z 5 0,255,0,255,0,0,0,0,0,0 Latin 87 0 48 Z # Z [5a ]A
[ 10 0,255,0,255,0,0,0,0,0,0 Common 49 10 50 [  # [ [5b ]p
] 10 0,255,0,255,0,0,0,0,0,0 Common 50 10 49 ]  # ] [5d ]p
| 0 0,255,0,255,0,0,0,0,0,0 Common 51 10 51 |   # | [7c ]
V 5 0,255,0,255,0,0,0,0,0,0 Latin 86 0 52 V # V [56 ]A
0 8 0,255,0,255,0,0,0,0,0,0 Common 53 2 53 0    # 0 [30 ]0
5 8 0,255,0,255,0,0,0,0,0,0 Common 54 2 54 5    # 5 [35 ]0
— 10 0,255,0,255,0,0,0,0,0,0 Common 55 10 55 -  # — [2014 ]p
_ 10 0,255,0,255,0,0,0,0,0,0 Common 56 10 56 _  # _ [5f ]p
€ 0 0,255,0,255,0,0,0,0,0,0 Common 57 4 57 €    # € [20ac ]
e 3 0,255,0,255,0,0,0,0,0,0 Latin 4 0 58 e  # e [65 ]a
t 3 0,255,0,255,0,0,0,0,0,0 Latin 15 0 59 t # t [74 ]a
a 3 0,255,0,255,0,0,0,0,0,0 Latin 10 0 60 a # a [61 ]a
n 3 0,255,0,255,0,0,0,0,0,0 Latin 5 0 61 n  # n [6e ]a
h 3 0,255,0,255,0,0,0,0,0,0 Latin 20 0 62 h # h [68 ]a
m 3 0,255,0,255,0,0,0,0,0,0 Latin 9 0 63 m  # m [6d ]a
s 3 0,255,0,255,0,0,0,0,0,0 Latin 22 0 64 s # s [73 ]a
g 3 0,255,0,255,0,0,0,0,0,0 Latin 11 0 65 g # g [67 ]a
i 3 0,255,0,255,0,0,0,0,0,0 Latin 6 0 66 i  # i [69 ]a
p 3 0,255,0,255,0,0,0,0,0,0 Latin 13 0 67 p # p [70 ]a
u 3 0,255,0,255,0,0,0,0,0,0 Latin 16 0 68 u # u [75 ]a
y 3 0,255,0,255,0,0,0,0,0,0 Latin 27 0 69 y # y [79 ]a
w 3 0,255,0,255,0,0,0,0,0,0 Latin 28 0 70 w # w [77 ]a
d 3 0,255,0,255,0,0,0,0,0,0 Latin 21 0 71 d # d [64 ]a
l 3 0,255,0,255,0,0,0,0,0,0 Latin 14 0 72 l # l [6c ]a
j 3 0,255,0,255,0,0,0,0,0,0 Latin 3 0 73 j  # j [6a ]a
r 3 0,255,0,255,0,0,0,0,0,0 Latin 8 0 74 r  # r [72 ]a
k 3 0,255,0,255,0,0,0,0,0,0 Latin 19 0 75 k # k [6b ]a
b 3 0,255,0,255,0,0,0,0,0,0 Latin 17 0 76 b # b [62 ]a
o 3 0,255,0,255,0,0,0,0,0,0 Latin 29 0 77 o # o [6f ]a
ê 3 0,255,0,255,0,0,0,0,0,0 Latin 24 0 78 ê # ê [ea ]a
é 3 0,255,0,255,0,0,0,0,0,0 Latin 34 0 79 é # é [e9 ]a
è 3 0,255,0,255,0,0,0,0,0,0 Latin 37 0 80 è # è [e8 ]a
4 8 0,255,0,255,0,0,0,0,0,0 Common 81 2 81 4    # 4 [34 ]0
6 8 0,255,0,255,0,0,0,0,0,0 Common 82 2 82 6    # 6 [36 ]0
9 8 0,255,0,255,0,0,0,0,0,0 Common 83 2 83 9    # 9 [39 ]0
f 3 0,255,0,255,0,0,0,0,0,0 Latin 7 0 84 f  # f [66 ]a
c 3 0,255,0,255,0,0,0,0,0,0 Latin 36 0 85 c # c [63 ]a
v 3 0,255,0,255,0,0,0,0,0,0 Latin 52 0 86 v # v [76 ]a
z 3 0,255,0,255,0,0,0,0,0,0 Latin 48 0 87 z # z [7a ]a
= 0 0,255,0,255,0,0,0,0,0,0 Common 88 10 88 =   # = [3d ]
< 0 0,255,0,255,0,0,0,0,0,0 Common 89 10 90 <   # < [3c ]
> 0 0,255,0,255,0,0,0,0,0,0 Common 90 10 89 >   # > [3e ]
@ 10 0,255,0,255,0,0,0,0,0,0 Common 91 10 91 @  # @ [40 ]p
$ 0 0,255,0,255,0,0,0,0,0,0 Common 92 4 92 $    # $ [24 ]
£ 0 0,255,0,255,0,0,0,0,0,0 Common 93 4 93 £    # £ [a3 ]

Shreeshrii commented 6 years ago

@topherseance Please see attached zip file which has a test training for Javanese including both Javanese and Latin script. Only trained (replace a layer) upto about 7% accuracy on the small training data that I could gather.

Keep us updated on your progress with training.

jav-traineddatas.zip

robbyablaze commented 6 years ago

jav-traineddatas.zip

@Shreeshrii hi, I am quite interested in this post, could you give me training data from this? i need to generate javanese script training data compactible with tesseract 3.04/3.05, I want to use that training data for android device, I use tess-two which is not yet compactible with tesseract 4.

Shreeshrii commented 6 years ago

generate javanese script training data compactible with tesseract 3.04/3.05

The requirements of training data for tesseract 3.0x are quite different from those for 4.0.0 LSTM training.

You can use jav-java text from UDHR or wikipedia as linked in posts above.

topherseance commented 5 years ago

Hello, sorry for the hiatus, had other tasks to do.

Only found 2 Javanese fonts so far:

I tried to create a starter traineddata for Noto Sans Javanese by using this command below, it works succesfully:

topher@topher-ubuntu:~$ ~/tesseract/src/training/tesstrain.sh   --fonts_dir ~/tess-javanese/fonts   --lang jav   --linedata_only   --noextract_font_properties   --langdata_dir ~/tesseract/langdata   --tessdata_dir ~/tesseract/tessdata   --fontlist "Noto Sans Javanese"   --output_dir ~/tess-javanese/jav01-train

=== Starting training for language 'jav'
[Sen Jul 9 14:37:14 WIB 2018] /usr/local/bin/text2image --fonts_dir=/home/topher/tess-javanese/fonts --font=Noto Sans Javanese --outputbase=/tmp/font_tmp.l81FA3YVZ2/sample_text.txt --text=/tmp/font_tmp.l81FA3YVZ2/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.l81FA3YVZ2
Stripped 1 unrenderable words
Rendered page 0 to file /tmp/font_tmp.l81FA3YVZ2/sample_text.txt.tif

=== Phase I: Generating training images ===
Rendering using Noto Sans Javanese
[Sen Jul 9 14:37:16 WIB 2018] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.l81FA3YVZ2 --fonts_dir=/home/topher/tess-javanese/fonts --strip_unrenderable_words --leading=32 --xsize 2560 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.TYPQCfx2ed/jav/jav.Noto_Sans_Javanese.exp0 --max_pages=0 --font=Noto Sans Javanese --text=/home/topher/tesseract/langdata/jav/jav.training_text
Rendered page 0 to file /tmp/tmp.TYPQCfx2ed/jav/jav.Noto_Sans_Javanese.exp0.tif

=== Phase UP: Generating unicharset and unichar properties files ===
[Sen Jul 9 14:37:17 WIB 2018] /usr/local/bin/unicharset_extractor --output_unicharset /tmp/tmp.TYPQCfx2ed/jav/jav.unicharset --norm_mode 1 /tmp/tmp.TYPQCfx2ed/jav/jav.Noto_Sans_Javanese.exp0.box
Extracting unicharset from box file /tmp/tmp.TYPQCfx2ed/jav/jav.Noto_Sans_Javanese.exp0.box
Word started with a combiner:0xa9b8
Normalization failed for string 'ꦸ'
Word started with a combiner:0xa9bc
Word started with a combiner:0xa981
Normalization failed for string 'ꦼꦁ'
Word started with a combiner:0xa9b8
Normalization failed for string 'ꦸ'
Word started with a combiner:0xa983
Normalization failed for string 'ꦃ'
Word started with a combiner:0xa9c0
Normalization failed for string '꧀ꦠ'
Word started with a combiner:0xa9bc
Normalization failed for string 'ꦼ'
Word started with a combiner:0xa9c0
Normalization failed for string '꧀ꦲ'
Word started with a combiner:0xa9b6
Word started with a combiner:0xa981
Normalization failed for string 'ꦶꦁ'
Word started with a combiner:0xa9b6
Normalization failed for string 'ꦶ'
Word started with a combiner:0xa9b6
Normalization failed for string 'ꦶ'
Word started with a combiner:0xa9b6
Normalization failed for string 'ꦶ'
Word started with a combiner:0xa983
Normalization failed for string 'ꦃ'
Word started with a combiner:0xa9b6
Normalization failed for string 'ꦶ'
Wrote unicharset file /tmp/tmp.TYPQCfx2ed/jav/jav.unicharset
[Sen Jul 9 14:37:17 WIB 2018] /usr/local/bin/set_unicharset_properties -U /tmp/tmp.TYPQCfx2ed/jav/jav.unicharset -O /tmp/tmp.TYPQCfx2ed/jav/jav.unicharset -X /tmp/tmp.TYPQCfx2ed/jav/jav.xheights --script_dir=/home/topher/tesseract/langdata
Loaded unicharset of size 16 from file /tmp/tmp.TYPQCfx2ed/jav/jav.unicharset
Setting unichar properties
Setting script properties
Failed to load script unicharset from:/home/topher/tesseract/langdata/Javanese.unicharset
Warning: properties incomplete for index 3 = ꧋
Warning: properties incomplete for index 4 = ꦱ
Warning: properties incomplete for index 5 = ꦒ
Warning: properties incomplete for index 6 = ꦫ
Warning: properties incomplete for index 7 = ꦮ
Warning: properties incomplete for index 8 = ꦮꦺ
Warning: properties incomplete for index 9 = ꦤ
Warning: properties incomplete for index 10 = ꦏ
Warning: properties incomplete for index 11 = ꦥꦺ
Warning: properties incomplete for index 12 = ꦝ
Warning: properties incomplete for index 13 = ꦪ
Warning: properties incomplete for index 14 = ꦗ
Warning: properties incomplete for index 15 = ꧉
Writing unicharset to file /tmp/tmp.TYPQCfx2ed/jav/jav.unicharset

=== Phase E: Generating lstmf files ===
Using TESSDATA_PREFIX=/home/topher/tesseract/tessdata
[Sen Jul 9 14:37:17 WIB 2018] /usr/local/bin/tesseract /tmp/tmp.TYPQCfx2ed/jav/jav.Noto_Sans_Javanese.exp0.tif /tmp/tmp.TYPQCfx2ed/jav/jav.Noto_Sans_Javanese.exp0 lstm.train
Tesseract Open Source OCR Engine v4.0.0-beta.2-342-g12f4 with Leptonica
Page 1

=== Constructing LSTM training data ===
[Sen Jul 9 14:37:17 WIB 2018] /usr/local/bin/combine_lang_model --input_unicharset /tmp/tmp.TYPQCfx2ed/jav/jav.unicharset --script_dir /home/topher/tesseract/langdata --words /home/topher/tesseract/langdata/jav/jav.wordlist --numbers /home/topher/tesseract/langdata/jav/jav.numbers --puncs /home/topher/tesseract/langdata/jav/jav.punc --output_dir /home/topher/tess-javanese/jav01-train --lang jav
Failed to read data from: /home/topher/tesseract/langdata/jav/jav.wordlist
Failed to read data from: /home/topher/tesseract/langdata/jav/jav.punc
Failed to read data from: /home/topher/tesseract/langdata/jav/jav.numbers
Loaded unicharset of size 16 from file /tmp/tmp.TYPQCfx2ed/jav/jav.unicharset
Setting unichar properties
Setting script properties
Failed to load script unicharset from:/home/topher/tesseract/langdata/Javanese.unicharset
Warning: properties incomplete for index 3 = ꧋
Warning: properties incomplete for index 4 = ꦱ
Warning: properties incomplete for index 5 = ꦒ
Warning: properties incomplete for index 6 = ꦫ
Warning: properties incomplete for index 7 = ꦮ
Warning: properties incomplete for index 8 = ꦮꦺ
Warning: properties incomplete for index 9 = ꦤ
Warning: properties incomplete for index 10 = ꦏ
Warning: properties incomplete for index 11 = ꦥꦺ
Warning: properties incomplete for index 12 = ꦝ
Warning: properties incomplete for index 13 = ꦪ
Warning: properties incomplete for index 14 = ꦗ
Warning: properties incomplete for index 15 = ꧉
Config file is optional, continuing...
Failed to read data from: /home/topher/tesseract/langdata/jav/jav.config
Null char=2
Moving /tmp/tmp.TYPQCfx2ed/jav/jav.Noto_Sans_Javanese.exp0.box to /home/topher/tess-javanese/jav01-train
Moving /tmp/tmp.TYPQCfx2ed/jav/jav.Noto_Sans_Javanese.exp0.tif to /home/topher/tess-javanese/jav01-train
Moving /tmp/tmp.TYPQCfx2ed/jav/jav.Noto_Sans_Javanese.exp0.lstmf to /home/topher/tess-javanese/jav01-train

Created starter traineddata for language 'jav'

Run lstmtraining to do the LSTM training for language 'jav'

But when I tried to do the same for Tuladha Jejeg font, it shows this error:

topher@topher-ubuntu:~$ ~/tesseract/src/training/tesstrain.sh   --fonts_dir ~/tess-javanese/fonts   --lang jav   --linedata_only   --noextract_font_properties   --langdata_dir ~/tesseract/langdata   --tessdata_dir ~/tesseract/tessdata   --fontlist "Tuladha Jejeg"   --output_dir ~/tess-javanese/jav02-train

=== Starting training for language 'jav'
[Sen Jul 9 14:45:14 WIB 2018] /usr/local/bin/text2image --fonts_dir=/home/topher/tess-javanese/fonts --font=Tuladha Jejeg --outputbase=/tmp/font_tmp.uAxequREYg/sample_text.txt --text=/tmp/font_tmp.uAxequREYg/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.uAxequREYg
Rendered page 0 to file /tmp/font_tmp.uAxequREYg/sample_text.txt.tif

=== Phase I: Generating training images ===
Rendering using Tuladha Jejeg
[Sen Jul 9 14:45:16 WIB 2018] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.uAxequREYg --fonts_dir=/home/topher/tess-javanese/fonts --strip_unrenderable_words --leading=32 --xsize 2560 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.k4Fb5CaR5k/jav/jav.Tuladha_Jejeg.exp0 --max_pages=0 --font=Tuladha Jejeg --text=/home/topher/tesseract/langdata/jav/jav.training_text
Rendered page 0 to file /tmp/tmp.k4Fb5CaR5k/jav/jav.Tuladha_Jejeg.exp0.tif

=== Phase UP: Generating unicharset and unichar properties files ===
[Sen Jul 9 14:45:17 WIB 2018] /usr/local/bin/unicharset_extractor --output_unicharset /tmp/tmp.k4Fb5CaR5k/jav/jav.unicharset --norm_mode 1 /tmp/tmp.k4Fb5CaR5k/jav/jav.Tuladha_Jejeg.exp0.box
Extracting unicharset from box file /tmp/tmp.k4Fb5CaR5k/jav/jav.Tuladha_Jejeg.exp0.box
Word started with a combiner:0xa9bc
Word started with a combiner:0xa981
Normalization failed for string 'ꦼꦁ'
Word started with a combiner:0xa983
Normalization failed for string 'ꦃ'
Word started with a combiner:0xa9c0
Normalization failed for string '꧀ꦠ'
Word started with a combiner:0xa9bc
Normalization failed for string 'ꦼ'
Word started with a combiner:0xa9c0
Normalization failed for string '꧀ꦲ'
Word started with a combiner:0xa9b6
Word started with a combiner:0xa981
Normalization failed for string 'ꦶꦁ'
Word started with a combiner:0xa9b6
Normalization failed for string 'ꦶ'
Word started with a combiner:0xa9b6
Normalization failed for string 'ꦶ'
Word started with a combiner:0xa9b6
Normalization failed for string 'ꦶ'
Word started with a combiner:0xa983
Normalization failed for string 'ꦃ'
Word started with a combiner:0xa9b6
Normalization failed for string 'ꦶ'
Wrote unicharset file /tmp/tmp.k4Fb5CaR5k/jav/jav.unicharset
[Sen Jul 9 14:45:17 WIB 2018] /usr/local/bin/set_unicharset_properties -U /tmp/tmp.k4Fb5CaR5k/jav/jav.unicharset -O /tmp/tmp.k4Fb5CaR5k/jav/jav.unicharset -X /tmp/tmp.k4Fb5CaR5k/jav/jav.xheights --script_dir=/home/topher/tesseract/langdata
Loaded unicharset of size 17 from file /tmp/tmp.k4Fb5CaR5k/jav/jav.unicharset
Setting unichar properties
Setting script properties
Failed to load script unicharset from:/home/topher/tesseract/langdata/Javanese.unicharset
Warning: properties incomplete for index 3 = ꧋
Warning: properties incomplete for index 4 = ꦱꦸ
Warning: properties incomplete for index 5 = ꦒ
Warning: properties incomplete for index 6 = ꦫ
Warning: properties incomplete for index 7 = ꦮꦸ
Warning: properties incomplete for index 8 = ꦮꦺ
Warning: properties incomplete for index 9 = ꦤ
Warning: properties incomplete for index 10 = ꦮ
Warning: properties incomplete for index 11 = ꦏ
Warning: properties incomplete for index 12 = ꦥꦺ
Warning: properties incomplete for index 13 = ꦝ
Warning: properties incomplete for index 14 = ꦪ
Warning: properties incomplete for index 15 = ꦗ
Warning: properties incomplete for index 16 = ꧉
Writing unicharset to file /tmp/tmp.k4Fb5CaR5k/jav/jav.unicharset

=== Phase E: Generating lstmf files ===
Using TESSDATA_PREFIX=/home/topher/tesseract/tessdata
[Sen Jul 9 14:45:17 WIB 2018] /usr/local/bin/tesseract /tmp/tmp.k4Fb5CaR5k/jav/jav.Tuladha_Jejeg.exp0.tif /tmp/tmp.k4Fb5CaR5k/jav/jav.Tuladha_Jejeg.exp0 lstm.train
Tesseract Open Source OCR Engine v4.0.0-beta.2-342-g12f4 with Leptonica
Page 1
Empty page!!
Empty page!!
ERROR: /tmp/tmp.k4Fb5CaR5k/jav/jav.Tuladha_Jejeg.exp0.lstmf does not exist or is not readable

The tesseract/langdata/jav contains only one file: jav.training_text, the contents are one line javanese text:

꧋ꦱꦸꦒꦼꦁꦫꦮꦸꦃꦮꦺꦤ꧀ꦠꦼꦤ꧀ꦲꦶꦁꦮꦶꦏꦶꦥꦺꦝꦶꦪꦃꦗꦮꦶ꧉

(taken from here)

Opened the /tmp/ folder, look after /tmp/tmp.k4Fb5CaR5k/jav/jav.Tuladha_Jejeg.exp0.tif, and I think it is rendered correctly. Here's the file: jav.Tuladha_Jejeg.exp0.zip Did I do everything right? Sorry if it was a rookie mistake.

Another info, Javanese script, by Unicode standard, has glyph-combining letters. (See Pasangan) Tuladha Jejeg uses SIL Graphite to do the combination, whereas Noto Sans Javanese uses OpenType's ligatures and anchors. (I think OpenType has a wider range of compatibility and support than SIL Graphite, for example, Chrome browser doesn't support SIL Graphite, causing javanese scripts won't render correctly for that font) Does it has anything to do with the error?

Thanks before

Shreeshrii commented 5 years ago

Text2image uses Pango for font rendering. It is possible that it does not support the SIL graphite fonts. I also get errors for Annapurna SIL devanagari font and do not use it.

Shreeshrii commented 5 years ago

I think I had used a couple.more fonts

topherseance commented 5 years ago

I see.. but why it seems the resulting .tif image is rendered correctly? It is the same when you compare it with the image from wikipedia link I provided.

topherseance commented 5 years ago

We plan on using the OCR for old textbook scans written in javanese script. So far, Tuladha Jejeg font is the most similar font with those found in old textbooks. Noto Sans Javanese looks a bit more 'modern'.

topherseance commented 5 years ago

Just tested with other text strings, some of them worked, some did not. here's what we found: ꦲꦤꦕꦫꦏ simple phrase, no gylph-combining --> success ꦲꦤꦕꦫꦏꦮꦺꦤ꧀ꦠꦼ uses glyph-combining: Pasangan --> success ꦲꦤꦕꦫꦏꦮꦺꦤ꧀ꦠꦼꦮꦸ uses glyph-combining: Sandhangan --> failure

Shreeshrii commented 5 years ago

see https://github.com/tesseract-ocr/tesseract/issues/1038

There may not be any existing Javanese script related rules. These will need to be added.

On Mon, Jul 9, 2018 at 6:50 PM Shree Devi Kumar shreeshrii@gmail.com wrote:

Word started with a combiner:0xa9bc Word started with a combiner:0xa981 Normalization failed for string 'ꦼꦁ' Word started with a combiner:0xa983 Normalization failed for string 'ꦃ' Word started with a combiner:0xa9c0 Normalization failed for string '꧀ꦠ' Word started with a combiner:0xa9bc Normalization failed for string 'ꦼ' Word started with a combiner:0xa9c0 Normalization failed for string '꧀ꦲ' Word started with a combiner:0xa9b6 Word started with a combiner:0xa981 Normalization failed for string 'ꦶꦁ' Word started with a combiner:0xa9b6 Normalization failed for string 'ꦶ' Word started with a combiner:0xa9b6 Normalization failed for string 'ꦶ' Word started with a combiner:0xa9b6 Normalization failed for string 'ꦶ' Word started with a combiner:0xa983 Normalization failed for string 'ꦃ' Word started with a combiner:0xa9b6 Normalization failed for string 'ꦶ'

Please look at the validation/normalization rules for Indic scripts in the code. Something there maybe triggering these errors.

On Mon, Jul 9, 2018 at 5:20 PM topherseance notifications@github.com wrote:

Just tested with other text strings, some of them worked, some did not. here's what we found: ꦲꦤꦕꦫꦏ simple phrase, no gylph-combining --> success ꦲꦤꦕꦫꦏꦮꦺꦤ꧀ꦠꦼ uses glyph-combining: Pasangan https://en.wikipedia.org/wiki/Javanese_script#Pasangan --> success ꦲꦤꦕꦫꦏꦮꦺꦤ꧀ꦠꦼꦮꦸ uses glyph-combining: Sandhangan https://en.wikipedia.org/wiki/Javanese_script#Sandhangan --> failure

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/126#issuecomment-403452754, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o_D04G4y8MeiheI5Yt1TlKO7fxFpks5uE0N6gaJpZM4Tfa_w .

--

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shreeshrii commented 5 years ago

I ignored the errors and continued with training using 5 fonts which seem to cover javanese code range.

eration 29986: ALIGNED TRUTH : ꦤ꧀ ꦄꦩꦺꦫꦶꦏ ꦒꦝꦃ ꦠꦁꦒꦺꦭ꧀ ꦏꦺꦕꦩꦠꦤ꧀ ꦥꦺꦏꦭꦺꦴꦔꦤ꧀ ꦧꦚ꧀ꦗꦸꦂ ꦭ
Iteration 29986: BEST OCR TEXT : ꦤ꧀ ꦄꦩꦺꦫꦶꦏ ꦒꦝꦃ ꦠꦁꦒꦺꦭ꧀ ꦏꦺꦕꦩꦠꦤ꧀ ꦥꦺꦏꦭꦺꦴꦔꦤ꧀ ꦧꦚ꧀ꦗꦂ ꦭ
File /tmp/tmp.hW6IvPk7gK/jav_java/jav_java.Noto_Sans_Javanese.exp2.lstmf page 100 :
Mean rms=1.605%, delta=4.39%, train=13.896%(52.571%), skip ratio=0.4%
Iteration 29987: ALIGNED TRUTH : ꦶꦠꦸꦠ꧀ ꦏꦺꦱꦺꦤꦶꦪꦤ꧀ ꦱꦩ꧀ꦥꦸꦤ꧀ ꦗꦁꦏꦺꦥ꧀ ꦫꦺꦏꦺꦴꦂ ꦥꦶꦪ ꦲꦺꦩ꧀ꦥ꧀ꦭꦺꦴꦏ꧀ ꦲꦺꦩ꧀ꦧꦃꦏꦏꦸꦁ ꦲꦺꦩ꧀ꦧꦃꦥꦸꦠꦿ
Iteration 29987: BEST OCR TEXT : ꦱꦶꦠꦸꦠ꧀ ꦏꦺꦱꦺꦤꦶꦪꦤ꧀ ꦱꦩ꧀ꦥꦸꦤ꧀ ꦗꦁꦏꦺꦥ꧀ ꦫꦺꦏꦺꦂ ꦥꦶꦪ ꦲꦺꦩꦺꦏ꧀ ꦲꦺꦩ꧀ꦧꦃꦏꦏꦸꦁ ꦲꦺꦩ꧀ꦧꦃ ꦥꦸꦠꦶ
File /tmp/tmp.hW6IvPk7gK/jav_java/jav_java.Tuladha_Jejeg.exp0.lstmf page 31 :
Mean rms=1.605%, delta=4.392%, train=13.901%(52.578%), skip ratio=0.4%
Iteration 29988: ALIGNED TRUTH : ꦤꦺꦴꦧꦺꦭ꧀ ꦱꦱ꧀ꦠꦿ ꦤꦺꦴꦧꦺꦭ꧀ ꦱ
Iteration 29988: BEST OCR TEXT : ꦏꦤꦺꦴꦧꦺꦭ꧀ ꦱꦱꦿ ꦤꦺꦴꦠꦺꦭ꧀ ꧈
File /tmp/tmp.hW6IvPk7gK/jav_java/jav_java.Tuladha_Jejeg.exp-1.lstmf page 499 :
Mean rms=1.605%, delta=4.392%, train=13.884%(52.578%), skip ratio=0.4%
Iteration 29989: ALIGNED TRUTH : ꦼꦱꦶꦲꦗꦶ ꦮꦼꦱꦶ ꦮꦼꦱ꧀ꦠ ꦮꦼꦱ꧀ꦥꦢ
Iteration 29989: BEST OCR TEXT : ꦼꦱꦶꦲꦗꦼ ꦮꦼꦱꦶ ꦮꦼꦱ꧀ꦠ ꦥꦼꦱ꧀ꦥꦢ
File /tmp/tmp.hW6IvPk7gK/jav_java/jav_java.Tuladha_Jejeg.exp1.lstmf page 123 :
Mean rms=1.605%, delta=4.386%, train=13.875%(52.559%), skip ratio=0.4%
Iteration 29990: ALIGNED TRUTH : ꦱ
Iteration 29990: BEST OCR TEXT : ꦕꦱ
File /tmp/tmp.hW6IvPk7gK/jav_java/jav_java.Tuladha_Jejeg.exp-2.lstmf page 362 :
Mean rms=1.606%, delta=4.395%, train=13.961%(52.617%), skip ratio=0.4%
Iteration 29991: ALIGNED TRUTH : ꦩꦪꦸꦫ ꦩꦫꦁ ꦩꦫꦏꦂꦩ ꦩꦫꦏꦠ
Iteration 29991: BEST OCR TEXT : ꦩꦥꦪꦸꦫ ꦩꦫꦁ ꦩꦫꦏꦂꦩ ꦩꦫꦏꦠ
File /tmp/tmp.hW6IvPk7gK/jav_java/jav_java.Tuladha_Jejeg.exp2.lstmf page 685 (Perfect):
Mean rms=1.605%, delta=4.391%, train=13.966%(52.592%), skip ratio=0.4%
Iteration 29992: ALIGNED TRUTH : ꦩꦺꦤ꧀ꦠ꧀ ꦩꦫꦺꦠ꧀ ꦠꦻꦴꦤ꧀ ꦝꦺꦮꦺꦏꦺ ꦲꦺꦤ꧀ꦠꦸꦏ꧀ ꦮꦶꦒꦠꦶ ꦱꦔꦺꦠ꧀ ꦱꦠꦸꦁ ꦒꦭ
Iteration 29992: BEST OCR TEXT : ꦩꦺꦤ꧀ꦠ꧀ꦩꦫꦺꦠ꧀ ꦠꦻꦴꦤ꧀ ꦝꦺꦮꦺꦏꦺꦲꦺꦤ꧀ꦠꦸꦏ꧀ꦮꦶꦒꦠꦶꦱ ꦔꦺꦠ꧀ ꦱꦠꦸꦁ ꦒꦭ
File /tmp/tmp.hW6IvPk7gK/jav_java/jav_java.Carakan_Anyar.exp0.lstmf page 604 :
Mean rms=1.605%, delta=4.391%, train=13.965%(52.612%), skip ratio=0.4%
Iteration 29993: ALIGNED TRUTH : ꦤ꧀ ꦱꦶꦗꦶ ꦏꦧꦸꦥꦠꦺꦤ꧀ ꦠꦥꦤꦸꦭꦶ ꦧꦒꦺꦪꦤ꧀ ꦱꦏꦶꦁ ꦆꦧꦸꦏꦸꦛ ꦏꦺꦕꦩ
Iteration 29993: BEST OCR TEXT : ꦤ꧀ ꦱꦶꦗꦶꦏꦧꦸꦥꦠꦺꦤ꧀ ꦠꦥꦤꦸꦭꦶꦧꦒꦺꦪꦤ꧀ ꦱꦏꦶꦁ ꦆꦧꦸꦤ꧀ꦛ ꦏꦺꦕꦩ
File /tmp/tmp.hW6IvPk7gK/jav_java/jav_java.Carakan_Anyar.exp-1.lstmf page 162 :
Mean rms=1.605%, delta=4.39%, train=13.965%(52.591%), skip ratio=0.4%
Iteration 29994: ALIGNED TRUTH : ꦱꦮꦶꦱꦺ ꦏꦧꦸꦥꦠꦺꦤ꧀ ꦕꦶꦪꦚ꧀ꦗꦸꦂ ꦱꦏ ꦠꦻꦴꦤ꧀ ꦥꦿꦺꦴꦮ꦳ꦶꦤ꧀ꦱꦶ ꦭꦶꦩ꧀ꦧꦸꦂꦒ꧀ ꦢꦶꦮ ꦩꦺꦤꦺꦲꦶ
Iteration 29994: BEST OCR TEXT : ꦱꦮꦶꦱꦺꦏꦧꦸꦥꦠꦺꦤ꧀ ꦕꦶꦪꦚ꧀ꦗꦸꦂ ꦱꦏ ꦠꦻꦴꦤ꧀ ꦥꦿꦺꦴꦮ꦳ꦶꦤ꧀ꦱꦶ ꦭꦶꦩ꧀ ꦧꦂꦒ꧀ ꦢꦶꦮ ꦩꦺꦤꦺꦲꦶ
File /tmp/tmp.hW6IvPk7gK/jav_java/jav_java.Carakan_Anyar.exp1.lstmf page 759 :
Mean rms=1.604%, delta=4.386%, train=13.948%(52.549%), skip ratio=0.4%
Iteration 29995: ALIGNED TRUTH : ꦝꦺ ꦝꦠꦺꦁ ꦏꦱꦸꦭ꧀ꦠꦤꦤ꧀ ꦢꦺꦩꦏ꧀ ꦏꦺꦕꦩꦠꦤ꧀ ꦏꦭꦶꦠꦶꦢꦸ ꦥꦂꦠꦻ ꦒꦺꦴꦭ꧀ꦏꦂ
Iteration 29995: BEST OCR TEXT : ꦝꦺꦝꦠꦺꦴꦁ ꦏꦱꦸꦭ꧀ꦠꦤꦤ꧀ ꦢꦺꦩꦏ꧀ ꦏꦺꦕꦩꦠꦤ꧀ ꦏꦭꦶꦠꦶꦢꦸꦥꦂ ꦠꦻ ꦒꦺꦴꦭ꧀ ꦏꦂ
File /tmp/tmp.hW6IvPk7gK/jav_java/jav_java.Carakan_Anyar.exp-2.lstmf page 160 :
Mean rms=1.603%, delta=4.37%, train=13.868%(52.512%), skip ratio=0.4%
Iteration 29996: ALIGNED TRUTH : ., ꦭꦶꦩ꧀ꦥꦢ꧀ ꦭꦶꦩ꧀ꦥꦸꦁ ꦭꦶꦁꦒꦶꦃ ꦭꦶꦁꦒ ꦭꦶꦁꦱꦁ ꦭꦶꦁꦱꦶꦂ ꦭꦶꦁꦱꦼꦩ꧀
Iteration 29996: BEST OCR TEXT : . , ꦭꦶꦩ꧀ꦥꦢ꧀ ꦭꦶꦩ꧀ꦥꦸꦁ ꦭꦶꦁꦒꦶꦃ ꦭꦶꦁꦒ ꦭꦶꦁꦱꦁ ꦭꦶꦁꦱꦶꦂ ꦭꦶꦱꦼꦩ꧀
File /tmp/tmp.hW6IvPk7gK/jav_java/jav_java.Carakan_Anyar.exp2.lstmf page 183 :
Mean rms=1.603%, delta=4.372%, train=13.869%(52.514%), skip ratio=0.4%
Iteration 29997: ALIGNED TRUTH : ꦥꦿꦏꦫꦤ ꦥꦿꦏꦫ ꦥꦿꦏꦮꦶꦱ꧀ ꦥꦿꦏꦱꦶꦠ ꦥꦿꦏꦱ ꦥꦿꦏꦫ ꦥꦿꦏꦮꦶꦱ꧀ ꦥꦿꦏꦱꦶꦠ ꦥꦿꦏꦱ ꦮꦶꦢꦸꦫ
Iteration 29997: BEST OCR TEXT : ꦥꦿꦏꦫꦤ ꦥꦿꦏꦫ ꦥꦿꦏꦮꦶꦱ꧀ ꦥꦿꦏꦱꦶꦠ ꦥꦿꦏꦱ ꦥꦿꦏꦫ ꦥꦿꦏꦮꦶꦱ꧀ ꦥꦿꦏꦱꦶꦠ ꦥꦿꦏꦱ ꦮꦶꦢꦸꦫ
File ./jav_java-layer_train/jav_java.Carakan-Unicode.exp0.lstmf page 321 (Perfect):
Mean rms=1.603%, delta=4.372%, train=13.869%(52.514%), skip ratio=0.4%
Iteration 29998: ALIGNED TRUTH : ꦤꦭꦶꦏ ꦏꦸꦮꦶ ꦒꦩ꧀ꦧꦂ:ꦥ꦳꧀ꦭꦒ꧀ ꦲꦺꦴꦥ꦳꧀ ꦲꦶꦁꦏꦁ ꦏꦺꦢꦃ ꦢꦤ꧀ ꦱꦺ
Iteration 29998: BEST OCR TEXT : ꦤꦭꦶꦏ ꦏꦸꦮꦶꦒꦩ꧀ꦧꦂꦥ꦳꧀ ꦭꦒ꧀ ꦲꦺꦴꦥ꦳꧀ ꦲꦶꦁꦏꦁ ꦏꦺꦢꦃ ꦢꦤ꧀ ꦱꦺ
File /tmp/tmp.hW6IvPk7gK/jav_java/jav_java.Carakan-Unicode.exp-1.lstmf page 3 :
Mean rms=1.602%, delta=4.368%, train=13.857%(52.475%), skip ratio=0.4%
Iteration 29999: ALIGNED TRUTH : ꦲꦱꦱ꧀ꦠ ꦲꦱꦶꦂ ꦲꦱꦶꦃ ꦲꦱꦶꦤ꧀ ꦲꦱꦶꦫꦤ꧀ ꦲꦱꦭ꧀ ꦲꦸꦱꦸꦭ꧀ - ꦲꦱꦭ꧀ ꦲꦱꦱ꧀ꦠ ꦲꦱꦶꦂ ꦲꦱꦶ
Iteration 29999: BEST OCR TEXT : ꦲꦱꦱ꧀ꦠ ꦲꦱꦶꦂ ꦲꦱꦶꦃ ꦲꦱꦶꦤꦏ꧀ ꦲꦱꦶꦫꦤ꧀ ꦲꦱꦭ꧀ꦲꦸꦱꦸꦁꦭ꧀ - ꦲꦱꦭ꧀ ꦲꦱꦱ꧀ꦠ ꦲꦱꦶꦂ ꦲꦱꦶ
File /tmp/tmp.hW6IvPk7gK/jav_java/jav_java.Carakan-Unicode.exp1.lstmf page 71 :
Mean rms=1.602%, delta=4.368%, train=13.852%(52.464%), skip ratio=0.4%
At iteration 28598/30000/30058, Mean rms=1.602%, delta=4.368%, char train=13.852%, word train=52.464%, skip ratio=0.4%,  New worst char error = 13.852 wrote checkpoint.

Finished! Error rate = 13.384

topherseance commented 5 years ago

Can you please share the commands and steps you did for the above training?

I still can't get the training to work successfully. I used the "training from scratch" method. Again, sorry if it is a newbie mistake.
Also, I couldn't find a .unicharset file in the langdata/jav, should I need to create one? The resulting log contains many occurences of this:

Encoding of string failed! Failure bytes: ffffffea ffffffa6 ffffff83 ffffffea ffffffa6 ffffffa9 ffffffea ffffffa6 ffffffb2 ffffffea ffffffa6 ffffffb8 ffffffea ffffffa7 ffffff88 20 ffffffea ffffffa6 ffffffb2 ffffffea ffffffa6 ffffffb6 ffffffea ffffffa6 ffffffaa ffffffea ffffffa6 ffffffba ffffffea ffffffa6 ffffffb4 ffffffea ffffffa6 ffffffb2 ffffffea ffffffa6 ffffff8f ffffffea ffffffa7 ffffff80 ffffffea ffffffa6 ffffffa9 ffffffea ffffffa6 ffffffb8 ffffffea ffffffa7 ffffff89 20 ffffffea ffffffa6 ffffff8f ffffffea ffffffa6 ffffffbc ffffffea ffffffa6 ffffffa4 ffffffea ffffffa6 ffffffad ffffffea ffffffa6 ffffffb6 ffffffea ffffffa6 ffffffad ffffffea ffffffa6 ffffffa4 ffffffea ffffffa7 ffffff80 ffffffea ffffffa6 ffffffa5 ffffffea ffffffa6 ffffffb2 ffffffea ffffffa6 ffffffa9 ffffffea ffffffa6 ffffffa4 ffffffea ffffffa6 ffffffb2 ffffffea ffffffa6 ffffff8f ffffffea ffffffa7 ffffff80 ffffffea ffffffa6 ffffffb2 ffffffea ffffffa6 ffffff8f ffffffea ffffffa7 ffffff80 ffffffea ffffffa6 ffffff8f ffffffea ffffffa6 ffffffb1 ffffffea ffffffa6 ffffffbc ffffffea ffffffa6 ffffffa7 ffffffea ffffffa6 ffffffb8 ffffffea ffffffa6 ffffffa0 ffffffea ffffffa7 ffffff80 ffffffea ffffffa7 ffffff88
Can't encode transcription: 'ꦏꦧꦺꦃꦩꦲꦸ꧈ ꦲꦶꦪꦺꦴꦲꦏ꧀ꦩꦸ꧉ ꦏꦼꦤꦭꦶꦭꦤ꧀ꦥꦲꦩꦤꦲꦏ꧀ꦲꦏ꧀ꦏꦱꦼꦧꦸꦠ꧀꧈' in language ''

I did run the unicharset_extractor with a .txt file containing javanese texts. Here's the resulting unicharset file:

jav.unicharset.txt

Each line contains 0,255,0,255,0,0,0,0,0,0, I guess it is some sort of coordinates, is it the correct value? Or maybe I should use it anyway? The unicharset file you copy-pasted earlier in this thread also contains 0,255,0,255,0,0,0,0,0,0 for each line.

topherseance commented 5 years ago

Found another font: Prada

Shreeshrii commented 5 years ago

Start of Javanese script unicode range may need to be added to https://github.com/tesseract-ocr/tesseract/blob/master/src/training/validator.h

On Thu, Jul 19, 2018 at 3:29 AM topherseance notifications@github.com wrote:

Found another font: Prada https://sites.google.com/site/fontsundaprada/unduh-font-sunda-prada/prada.ttf

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/126#issuecomment-406087613, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o78VzTinzCsjGIABMgZmA9cOsczXks5uH6_WgaJpZM4Tfa_w .

--

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

amitdo commented 5 years ago

Encoding of string failed! Failure bytes: ffffffea ffffffa6 ffffff83 ffffffea ffffffa6 ffffffa9 ffffffea ffffffa6 ffffffb2 ffffffea ffffffa6 ffffffb8 ffffffea ffffffa7 ffffff88

The text is clearly not encoded in utf-8.

Shreeshrii commented 5 years ago

Can you please share the commands and steps you did for the above training?

Please see https://github.com/Shreeshrii/tessdata_jav_java

topherseance commented 5 years ago

I collected few javanese aksara here, probably has about several thousand textlines: https://github.com/topherseance/bible_javanese_aksara

topherseance commented 5 years ago

@Shreeshrii when you run your scripts, layertrain.sh or plustrain.sh, did you receive the Encoding of string failed error?

I ran the script and still got this:

File /tmp/tmp.fAmoYPWBIL/jav_java/jav_java.Carakan_Anyar.exp1.lstmf page 569 :
Mean rms=5.024%, delta=42.402%, train=100.11%(100%), skip ratio=61.7%
Encoding of string failed! Failure bytes: ffffffe2 ffffff80 ffffff8b ffffffea ffffffa6 ffffffb1 ffffffea ffffffa6 ffffffb6 ffffffea ffffffa6 ffffff81 20 ffffffea ffffffa6 ffffffa2 ffffffea ffffffa6 ffffffb6 ffffffea ffffffa6 ffffffaa ffffffea ffffffa6 ffffffaa ffffffea ffffffa6 ffffffb2 ffffffea ffffffa6 ffffffb6 20 ffffffea ffffffa6 ffffffa2 ffffffea ffffffa6 ffffffb8 ffffffea ffffffa6 ffffffa9
Can't encode transcription: 'ꦮꦺꦠꦤ꧀ꦱꦶꦁ ꦢꦶꦪꦪꦲꦶ ꦢꦸꦩ' in language ''
Iteration 1171: ALIGNED TRUTH : ꦩꦭꦁꦲꦠꦺꦤꦶ ꦧꦸꦩꦶ ꦧꦸꦩ꧀ꦥꦼꦠ꧀
Iteration 1171: BEST OCR TEXT : 
File /tmp/tmp.fAmoYPWBIL/jav_java/jav_java.Tuladha_Jejeg.exp0.lstmf page 480 :
Mean rms=5.024%, delta=42.397%, train=100.111%(100%), skip ratio=61.7%
Iteration 1172: ALIGNED TRUTH : ꦄꦢꦩ꧀ ꦩꦭꦶꦏ꧀ ꦮ꦳ꦶꦢꦺꦪꦺꦴ ꦒ
Iteration 1172: BEST OCR TEXT : 
File /tmp/tmp.fAmoYPWBIL/jav_java/jav_java.Carakan_Anyar.exp-1.lstmf page 40 :
Mean rms=5.023%, delta=42.37%, train=100.113%(100%), skip ratio=61.7%
Iteration 1173: ALIGNED TRUTH : ꦏꦤ꧀ꦕ꧀ꦂꦶꦠ꧀ ꦏꦤ꧀ꦛꦶꦁ
Iteration 1173: BEST OCR TEXT :

Checked the encoding of jav.training_text, I guess it is encoded in UTF-8

topher@topher-ubuntu:~/tesseract/langdata/jav$ file -i jav.training_text
jav.training_text: text/plain; charset=utf-8

Shreeshrii commented 5 years ago

Are you getting the error on all lines of training text or just some lines?

I have had the error before but not with the current set of files.

What is your locale?

On Mon 6 Aug, 2018, 12:47 PM topherseance, notifications@github.com wrote:

@Shreeshrii https://github.com/Shreeshrii when you run your scripts, layertrain.sh or plustrain.sh, did you receive the Encoding of string failed error?

I ran the script and still got this:

File /tmp/tmp.fAmoYPWBIL/jav_java/jav_java.Carakan_Anyar.exp1.lstmf page 569 : Mean rms=5.024%, delta=42.402%, train=100.11%(100%), skip ratio=61.7% Encoding of string failed! Failure bytes: ffffffe2 ffffff80 ffffff8b ffffffea ffffffa6 ffffffb1 ffffffea ffffffa6 ffffffb6 ffffffea ffffffa6 ffffff81 20 ffffffea ffffffa6 ffffffa2 ffffffea ffffffa6 ffffffb6 ffffffea ffffffa6 ffffffaa ffffffea ffffffa6 ffffffaa ffffffea ffffffa6 ffffffb2 ffffffea ffffffa6 ffffffb6 20 ffffffea ffffffa6 ffffffa2 ffffffea ffffffa6 ffffffb8 ffffffea ffffffa6 ffffffa9 Can't encode transcription: 'ꦮꦺꦠꦤ꧀ꦱꦶꦁ ꦢꦶꦪꦪꦲꦶ ꦢꦸꦩ' in language '' Iteration 1171: ALIGNED TRUTH : ꦩꦭꦁꦲꦠꦺꦤꦶ ꦧꦸꦩꦶ ꦧꦸꦩ꧀ꦥꦼꦠ꧀ Iteration 1171: BEST OCR TEXT : File /tmp/tmp.fAmoYPWBIL/jav_java/jav_java.Tuladha_Jejeg.exp0.lstmf page 480 : Mean rms=5.024%, delta=42.397%, train=100.111%(100%), skip ratio=61.7% Iteration 1172: ALIGNED TRUTH : ꦄꦢꦩ꧀ ꦩꦭꦶꦏ꧀ ꦮ꦳ꦶꦢꦺꦪꦺꦴ ꦒ Iteration 1172: BEST OCR TEXT : File /tmp/tmp.fAmoYPWBIL/jav_java/jav_java.Carakan_Anyar.exp-1.lstmf page 40 : Mean rms=5.023%, delta=42.37%, train=100.113%(100%), skip ratio=61.7% Iteration 1173: ALIGNED TRUTH : ꦏꦤ꧀ꦕ꧀ꦂꦶꦠ꧀ ꦏꦤ꧀ꦛꦶꦁ Iteration 1173: BEST OCR TEXT :

Checked the encoding of jav.training_text, I guess it is encoded in UTF-8

topher@topher-ubuntu:~/tesseract/langdata/jav$ file -i jav.training_text jav.training_text: text/plain; charset=utf-8

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/126#issuecomment-410611481, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_oyxLRSEm1PBGpSlEa_nM9rTjNJpDks5uN-1sgaJpZM4Tfa_w .

topherseance commented 5 years ago

Just some lines, I guess. My locale is EN.

Shreeshrii commented 5 years ago

There is probably some invisible code or character that is not in unicharset. You can try to identify it from the text and provided codes. If it is only a few lines, you can ignore.

I have tried more training but still not getting much better results, error rate is around 7% on training set.

On Mon 6 Aug, 2018, 10:31 PM topherseance, notifications@github.com wrote:

Just some lines, I guess. My locale is EN.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/126#issuecomment-410777703, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o0-LjXqa5ywOofM-MQ7kq22rg_y7ks5uOHZpgaJpZM4Tfa_w .

Shreeshrii commented 5 years ago

My locale is en_us.utf8

That might make some difference in the display of the codes.

On Tue 7 Aug, 2018, 9:18 AM Shree Devi Kumar, shreeshrii@gmail.com wrote:

There is probably some invisible code or character that is not in unicharset. You can try to identify it from the text and provided codes. If it is only a few lines, you can ignore.

I have tried more training but still not getting much better results, error rate is around 7% on training set.

On Mon 6 Aug, 2018, 10:31 PM topherseance, notifications@github.com wrote:

Just some lines, I guess. My locale is EN.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/126#issuecomment-410777703, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o0-LjXqa5ywOofM-MQ7kq22rg_y7ks5uOHZpgaJpZM4Tfa_w .

Shreeshrii commented 5 years ago

Ok, I think I got the reason for the error. It is related to the text not passing the 'normalization' rules as setup for the script.

For Javanese, I copied rules from existing languages but these need to be verified and corrected. https://github.com/tesseract-ocr/tesseract/blob/master/src/training/validate_javanese.cpp

On Tue, Aug 7, 2018 at 9:56 AM Shree Devi Kumar shreeshrii@gmail.com wrote:

My locale is en_us.utf8

That might make some difference in the display of the codes.

On Tue 7 Aug, 2018, 9:18 AM Shree Devi Kumar, shreeshrii@gmail.com wrote:

There is probably some invisible code or character that is not in unicharset. You can try to identify it from the text and provided codes. If it is only a few lines, you can ignore.

I have tried more training but still not getting much better results, error rate is around 7% on training set.

On Mon 6 Aug, 2018, 10:31 PM topherseance, notifications@github.com wrote:

Just some lines, I guess. My locale is EN.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/126#issuecomment-410777703, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o0-LjXqa5ywOofM-MQ7kq22rg_y7ks5uOHZpgaJpZM4Tfa_w .

--

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shreeshrii commented 5 years ago

I am getting the following kind of errors. Please check whether the Javanese text is valid.

Word started with a combiner:0xa9ba Word started with a combiner:0xa9b4 Normalization failed for string 'ꦺꦴ' Word started with a combiner:0xa9c0 Normalization failed for string '꧀ꦲꦶꦁ' Invalid start of grapheme sequence:M=0xa9ba Normalization failed for string '꧇ꦺ' Invalid start of grapheme sequence:M=0xa9ba Normalization failed for string 'ꦩ꧀ꦥꦸꦺ' Invalid start of grapheme sequence:M=0xa9ba Normalization failed for string 'ꦏ꧀ꦢꦴꦺ' Invalid start of grapheme sequence:M=0xa9ba Normalization failed for string '꧇ꦺ' Invalid start of grapheme sequence:M=0xa9ba Normalization failed for string 'ꦲꦶꦁꦺ' Invalid start of grapheme sequence:M=0xa9ba Normalization failed for string 'ꦗꦿꦴꦺ' Invalid start of grapheme sequence:M=0xa9ba Normalization failed for string 'ꦧꦼꦺ' Invalid start of grapheme sequence:M=0xa9ba Normalization failed for string 'ꦤ꧀ꦱꦴꦺ' Invalid start of grapheme sequence:M=0xa9ba Normalization failed for string 'ꦮꦶꦺ' Invalid start of grapheme sequence:M=0xa9ba Normalization failed for string 'ꦤ꧀ꦏꦁꦺ' Invalid start of grapheme sequence:M=0xa9ba Normalization failed for string 'ꦒꦴꦺ' Invalid start of grapheme sequence:M=0xa9ba Normalization failed for string 'ꦠ꧀ꦤꦶꦺ' Invalid start of grapheme sequence:M=0xa9ba

On Tue, Aug 7, 2018 at 8:36 PM Shree Devi Kumar shreeshrii@gmail.com wrote:

Ok, I think I got the reason for the error. It is related to the text not passing the 'normalization' rules as setup for the script.

For Javanese, I copied rules from existing languages but these need to be verified and corrected.

https://github.com/tesseract-ocr/tesseract/blob/master/src/training/validate_javanese.cpp

On Tue, Aug 7, 2018 at 9:56 AM Shree Devi Kumar shreeshrii@gmail.com wrote:

My locale is en_us.utf8

That might make some difference in the display of the codes.

On Tue 7 Aug, 2018, 9:18 AM Shree Devi Kumar, shreeshrii@gmail.com wrote:

There is probably some invisible code or character that is not in unicharset. You can try to identify it from the text and provided codes. If it is only a few lines, you can ignore.

I have tried more training but still not getting much better results, error rate is around 7% on training set.

On Mon 6 Aug, 2018, 10:31 PM topherseance, notifications@github.com wrote:

Just some lines, I guess. My locale is EN.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/126#issuecomment-410777703, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o0-LjXqa5ywOofM-MQ7kq22rg_y7ks5uOHZpgaJpZM4Tfa_w .

--

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shreeshrii commented 5 years ago

@ topherseance

The LSTM training langdata for Javanese in Latin script is now available at https://github.com/tesseract-ocr/langdata_lstm/tree/master/jav

You can convert it to aksara jawa to use for training.

topherseance commented 5 years ago

Done converting this file to aksara jawa:
https://github.com/tesseract-ocr/langdata_lstm/blob/master/jav/jav.training_text
Result:
https://github.com/topherseance/javanese-aksara-training-text

But what about the other files?
For example, .numbers, .wordlist.
Is the .numbers file correct? Seems to contain random letters..

Shreeshrii commented 5 years ago

Thanks.

Please convert the wordlist also.

The numbers file should actually just show patterns where numbers are found. Please look at the file for English as a sample.

On Tue, Aug 14, 2018 at 11:05 AM topherseance notifications@github.com wrote:

Done converting this file to aksara jawa:

https://github.com/tesseract-ocr/langdata_lstm/blob/master/jav/jav.training_text Result: https://github.com/topherseance/javanese-aksara-training-text

But what about the other files? For example, .numbers, .wordlist. Is the .numbers file correct? Seems to contain random letters..

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/126#issuecomment-412759720, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o7yeAUKbXxt8zXaXa51o6IMYFIYQks5uQmGngaJpZM4Tfa_w .

--

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shreeshrii commented 5 years ago

The conversion does not retain any spaces between words in the lines. It seems that Javanese script does not require the spaces, but maybe it will help in training to have words in a sentence separated by space.

On Tue, Aug 14, 2018 at 5:37 PM Shree Devi Kumar shreeshrii@gmail.com wrote:

Thanks.

Please convert the wordlist also.

The numbers file should actually just show patterns where numbers are found. Please look at the file for English as a sample.

On Tue, Aug 14, 2018 at 11:05 AM topherseance notifications@github.com wrote:

Done converting this file to aksara jawa:

https://github.com/tesseract-ocr/langdata_lstm/blob/master/jav/jav.training_text Result: https://github.com/topherseance/javanese-aksara-training-text

But what about the other files? For example, .numbers, .wordlist. Is the .numbers file correct? Seems to contain random letters..

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/126#issuecomment-412759720, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o7yeAUKbXxt8zXaXa51o6IMYFIYQks5uQmGngaJpZM4Tfa_w .

--

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

topherseance commented 5 years ago

Updated my repo to include conversion result with whitespaces: https://github.com/topherseance/javanese-aksara-training-text https://github.com/topherseance/javanese-aksara-training-text/blob/master/with-whitespace-combined.txt

Shreeshrii commented 5 years ago

Thank you.

On Thu, Aug 16, 2018 at 12:22 AM, topherseance notifications@github.com wrote:

Updated my repo to include conversion result with whitespaces: https://github.com/topherseance/javanese-aksara-training-text https://github.com/topherseance/javanese-aksara-training-text/blob/master/ with-whitespace-combined.txt

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/126#issuecomment-413297999, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o7AzrqkOhomn1NcrFJq12gcPfpg7ks5uRG3bgaJpZM4Tfa_w .

--

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

gindrawan commented 4 years ago

Hi, first I'm sorry perhaps this is different topic but I think quite related. I've opened an issue at https://github.com/tesseract-ocr/langdata/issues/152 (Balinese script OCR) but stiil confuse (newbie syndrom) until finally landed here.

I'm ready for collecting training text but still on hold since same case like Javanese fonts, Balinese script has Bali SImbar Dwijendra font (see posted issue) most similar to ancient script but not yet tested for training (I'm afraid same incompatibility issue like Tuladha Jejeg, will check soon). At the other side Balinese script also has Noto Sans/Seri Balinese from Google.

Also, I've download https://github.com/Shreeshrii/tessdata_jav_java, and its README.md said "Source code changes will be needed in tesseract... "

Could you direct me how to use all material here since Javanese script has big influence to Balinese script. Geograhically, Bali and Java are also neighbor to each other.

Thank you very much before for your kind attention.

Shreeshrii commented 4 years ago

I had done aksara jawa training and created two traineddata files - see links given in https://github.com/Shreeshrii/tessdata_jav_java/blob/master/README.md But I am not sure how accurate those are or whether @topherseance did further training on same.

The changes to tesseract codebase were made via:

https://github.com/tesseract-ocr/tesseract/commit/0eb7be1cd1707931abd77903793bf966a6640d58#diff-eaafd22a79065f5b8d28318d482e650d

https://github.com/tesseract-ocr/tesseract/commit/7957288fd5502551b6c7f073c5f4ecd1f0b11dd8#diff-eaafd22a79065f5b8d28318d482e650d

https://github.com/tesseract-ocr/tesseract/commit/b34cf9d424e88cd09aaa193697127c90ff76e0ce#diff-eaafd22a79065f5b8d28318d482e650d

gindrawan commented 4 years ago

Thanks for the quick response @Shreeshrii

Here is the update condition:

At the attachment, we have 2 fonts with Balinese-unicode, namely Vimala (most similar to the non-Balinese-unicode Bali SImbar Dwijendra) and Noto Sans Balinese (like Javanese that has Noto Sans Javanese).
I want to use https://github.com/Shreeshrii/tessdata_jav_java as a base for training with my Balinese training text. See the attachment for the Balinese version of Article 1 of the Universal Declaration of Human Rights (https://en.wikipedia.org/wiki/Balinese_script). And about three code for that text, I don't know, jav for javanese, bal for balinese?

The question is how do I do that? For several hours try to learn and get the strategy but still far away..

bal.training_text.txt

balinese-unicode.zip

Shreeshrii commented 4 years ago

UDHR is a small text. You will need larger text for training.

LSTM training takes time, days and weeks.

On Mon, Mar 23, 2020, 10:17 gindrawan notifications@github.com wrote:

Thanks for the quick response @Shreeshrii https://github.com/Shreeshrii

Here is the update condition:

At the attachment, we have 2 fonts with Balinese-unicode, namely Vimala (most similar to the non-Balinese-unicode Bali SImbar Dwijendra) and Noto Sans Balinese (like Javanese that has Noto Sans Javanese).

I want to use https://github.com/Shreeshrii/tessdata_jav_java as a base for training with my Balinese training text. See the attachment for the Balinese version of Article 1 of the Universal Declaration of Human Rights (https://en.wikipedia.org/wiki/Balinese_script). And about three code for that text, I don't know, jav for javanese, bal for balinese?

The question is how do I do that? For several hours try to learn and get the strategy but still far away..

bal.training_text.txt https://github.com/tesseract-ocr/langdata/files/4367222/bal.training_text.txt

balinese-unicode.zip https://github.com/tesseract-ocr/langdata/files/4367183/balinese-unicode.zip

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/126#issuecomment-602383280, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I4MBIBCLURDZ7ZFML3RI3SXFANCNFSM4E35V7YA .

gindrawan commented 4 years ago

UDHR is a small text. You will need larger text for training. LSTM training takes time, days and weeks. … On Mon, Mar 23, 2020, 10:17 gindrawan @.***> wrote: Thanks for the quick response @Shreeshrii https://github.com/Shreeshrii Here is the update condition: 1. At the attachment, we have 2 fonts with Balinese-unicode, namely Vimala (most similar to the non-Balinese-unicode Bali SImbar Dwijendra) and Noto Sans Balinese (like Javanese that has Noto Sans Javanese). 2. I want to use https://github.com/Shreeshrii/tessdata_jav_java as a base for training with my Balinese training text. See the attachment for the Balinese version of Article 1 of the Universal Declaration of Human Rights (https://en.wikipedia.org/wiki/Balinese_script). And about three code for that text, I don't know, jav for javanese, bal for balinese? The question is how do I do that? For several hours try to learn and get the strategy but still far away.. bal.training_text.txt https://github.com/tesseract-ocr/langdata/files/4367222/bal.training_text.txt balinese-unicode.zip https://github.com/tesseract-ocr/langdata/files/4367183/balinese-unicode.zip — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#126 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I4MBIBCLURDZ7ZFML3RI3SXFANCNFSM4E35V7YA .

Yes, I know that.. I want to start from small training text first and incrementally add later (if possible) while get more understanding to the learning process. I have already had larger training text in Noto Sans Balinese (up to 30 thousand words and possibility doubling it for Vimala). More likely, the number continue to grow since there are other sources that still haven't processed yet. I dont know if that number is enough..

Shreeshrii commented 4 years ago

Language and script codes follow the assigned names as per standards bodies.

Balinese language three letter code is ban. Balinese script is bali. The script can be used for a couple of other languages also.

Shreeshrii commented 4 years ago

I suggest moving discussion to issue at #152 https://github.com/tesseract-ocr/langdata/issues/152 (Balinese script OCR

On Mon, Mar 23, 2020, 17:59 Shree Devi Kumar shreeshrii@gmail.com wrote:

Language and script codes follow the assigned names as per standards bodies.

Balinese language three letter code is ban. Balinese script is bali. The script can be used for a couple of other languages also.

gindrawan commented 4 years ago

Ok, thanks. I'll post the update at https://github.com/tesseract-ocr/langdata/issues/152.

bennylin commented 3 years ago

@topherseance: if you're still looking for the Javanese OCR, a team in UKDW is working on it.

https://trawaca.id/?lang=en
https://meta.wikimedia.org/wiki/WikiNusantara_2019/Acara/OCR_Aksara_Jawa (presentation on Indonesian Wikipedia Meetup)
https://commons.wikimedia.org/wiki/Early_Javanese_books - PDFs of Javanese script.

tesseract-ocr / langdata

Add Javanese Script for jav-java #126