Can add new char to unicharset file if not appear in langdata folder?

hoangtocdo90 commented 7 years ago

Hi guys ! I'm try training tesseract in Japanese. In Japanese has some type of char. In my case it's about Halfwidth and fullwidth in Katakana table. Half-width Katakana Example : ｱｲｳｴｵｶｷｸｹｺ Full-width Katakana Example : アイウエオ　カキクケコ It's really look like similar or look like uppercase and lowercase but diffirence When input a Halfwidth katakana, Tesseract can't recognize or some times out with Full-width katakana.

I try to using text2img make image and box, doing ltsm.train. But have some problem with unicharset! set_unicharset_properties -U unicharset -O unicharset -X jpn.xheights --script_dir=./langdata I have checked in langdata/Katakana.unicharset. Don't have any half-width katakana symbol. Because of this i can't make a unicharset file with all the fields set to the right values, like in this example

This is my unicharset file i got from run command

unicharset_extractor jpn.msgothic.exp18.box jpn.msgothic.exp32.box jpn.msgothic.exp48.box jpn.msgothicb.exp18.box

414 NULL 0 NULL 0 Joined 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # Joined [4a 6f 69 6e 65 64 ] |Broken|0|1 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # Broken ｱ 1 0,255,0,255,0,0,0,0,0,0 NULL 3 0 0 # ｱ [ff71 ]x ｳ 1 0,255,0,255,0,0,0,0,0,0 NULL 4 0 0 # ｳ [ff73 ]x ﾄ 1 0,255,0,255,0,0,0,0,0,0 NULL 5 0 0 # ﾄ [ff84 ]x ﾌ 1 0,255,0,255,0,0,0,0,0,0 NULL 6 0 0 # ﾌ [ff8c ]x ﾟ 1 0,255,0,255,0,0,0,0,0,0 NULL 7 0 0 # ﾟ [ff9f ]x ｯ 1 0,255,0,255,0,0,0,0,0,0 NULL 8 0 0 # ｯ [ff6f ]x ﾏ 1 0,255,0,255,0,0,0,0,0,0 NULL 9 0 0 # ﾏ [ff8f ]x ﾙ 1 0,255,0,255,0,0,0,0,0,0 NULL 10 0 0 # ﾙ [ff99 ]x ﾊ 1 0,255,0,255,0,0,0,0,0,0 NULL 11 0 0 # ﾊ [ff8a ]x ﾁ 1 0,255,0,255,0,0,0,0,0,0 NULL 12 0 0 # ﾁ [ff81 ]x ｼ 1 0,255,0,255,0,0,0,0,0,0 NULL 13 0 0 # ｼ [ff7c ]x ｶ 1 0,255,0,255,0,0,0,0,0,0 NULL 14 0 0 # ｶ [ff76 ]x ｵ 1 0,255,0,255,0,0,0,0,0,0 NULL 15 0 0 # ｵ [ff75 ]x ｴ 1 0,255,0,255,0,0,0,0,0,0 NULL 16 0 0 # ｴ [ff74 ]x ﾗ 1 0,255,0,255,0,0,0,0,0,0 NULL 17 0 0 # ﾗ [ff97 ]x ﾝ 1 0,255,0,255,0,0,0,0,0,0 NULL 18 0 0 # ﾝ [ff9d ]x ﾋ 1 0,255,0,255,0,0,0,0,0,0 NULL 19 0 0 # ﾋ [ff8b ]x ﾞ 1 0,255,0,255,0,0,0,0,0,0 NULL 20 0 0 # ﾞ [ff9e ]x ﾕ 1 0,255,0,255,0,0,0,0,0,0 NULL 21 0 0 # ﾕ [ff95 ]x ｰ 1 0,255,0,255,0,0,0,0,0,0 NULL 22 0 0 # ｰ [ff70 ]x ｻ 1 0,255,0,255,0,0,0,0,0,0 NULL 23 0 0 # ｻ [ff7b ]x ｫ 1 0,255,0,255,0,0,0,0,0,0 NULL 24 0 0 # ｫ [ff6b ]x ｽ 1 0,255,0,255,0,0,0,0,0,0 NULL 25 0 0 # ｽ [ff7d ]x ﾃ 1 0,255,0,255,0,0,0,0,0,0 NULL 26 0 0 # ﾃ [ff83 ]x ｨ 1 0,255,0,255,0,0,0,0,0,0 NULL 27 0 0 # ｨ [ff68 ]x ヴ 1 0,255,0,255,0,0,0,0,0,0 NULL 28 0 0 # ヴ [30f4 ]x ｮ 1 0,255,0,255,0,0,0,0,0,0 NULL 29 0 0 # ｮ [ff6e ]x ｪ 1 0,255,0,255,0,0,0,0,0,0 NULL 30 0 0 # ｪ [ff6a ]x ﾉ 1 0,255,0,255,0,0,0,0,0,0 NULL 31 0 0 # ﾉ [ff89 ]x ﾎ 1 0,255,0,255,0,0,0,0,0,0 NULL 32 0 0 # ﾎ [ff8e ]x ﾔ 1 0,255,0,255,0,0,0,0,0,0 NULL 33 0 0 # ﾔ [ff94 ]x ﾘ 1 0,255,0,255,0,0,0,0,0,0 NULL 35 0 0 # ﾘ [ff98 ]x ﾈ 1 0,255,0,255,0,0,0,0,0,0 NULL 36 0 0 # ﾈ [ff88 ]x ｲ 1 0,255,0,255,0,0,0,0,0,0 NULL 37 0 0 # ｲ [ff72 ]x ﾍ 1 0,255,0,255,0,0,0,0,0,0 NULL 38 0 0 # ﾍ [ff8d ]x ｸ 1 0,255,0,255,0,0,0,0,0,0 NULL 39 0 0 # ｸ [ff78 ]x ﾀ 1 0,255,0,255,0,0,0,0,0,0 NULL 40 0 0 # ﾀ [ff80 ]x ﾆ 1 0,255,0,255,0,0,0,0,0,0 NULL 41 0 0 # ﾆ [ff86 ]x ｹ 1 0,255,0,255,0,0,0,0,0,0 NULL 42 0 0 # ｹ [ff79 ]x ｺ 1 0,255,0,255,0,0,0,0,0,0 NULL 43 0 0 # ｺ [ff7a ]x ﾅ 1 0,255,0,255,0,0,0,0,0,0 NULL 44 0 0 # ﾅ [ff85 ]x ﾛ 1 0,255,0,255,0,0,0,0,0,0 NULL 45 0 0 # ﾛ [ff9b ]x ﾒ 1 0,255,0,255,0,0,0,0,0,0 NULL 46 0 0 # ﾒ [ff92 ]x ｿ 1 0,255,0,255,0,0,0,0,0,0 NULL 47 0 0 # ｿ [ff7f ]x ﾐ 1 0,255,0,255,0,0,0,0,0,0 NULL 48 0 0 # ﾐ [ff90 ]x ｾ 1 0,255,0,255,0,0,0,0,0,0 NULL 49 0 0 # ｾ [ff7e ]x ｷ 1 0,255,0,255,0,0,0,0,0,0 NULL 50 0 0 # ｷ [ff77 ]x ﾜ 1 0,255,0,255,0,0,0,0,0,0 NULL 51 0 0 # ﾜ [ff9c ]x ﾚ 1 0,255,0,255,0,0,0,0,0,0 NULL 52 0 0 # ﾚ [ff9a ]x ｬ 1 0,255,0,255,0,0,0,0,0,0 NULL 53 0 0 # ｬ [ff6c ]x ｭ 1 0,255,0,255,0,0,0,0,0,0 NULL 54 0 0 # ｭ [ff6d ]x ﾓ 1 0,255,0,255,0,0,0,0,0,0 NULL 60 0 0 # ﾓ [ff93 ]x ﾑ 1 0,255,0,255,0,0,0,0,0,0 NULL 61 0 0 # ﾑ [ff91 ]x ｧ 1 0,255,0,255,0,0,0,0,0,0 NULL 62 0 0 # ｧ [ff67 ]x ｩ 1 0,255,0,255,0,0,0,0,0,0 NULL 63 0 0 # ｩ [ff69 ]x ﾂ 1 0,255,0,255,0,0,0,0,0,0 NULL 64 0 0 # ﾂ [ff82 ]x ﾖ 1 0,255,0,255,0,0,0,0,0,0 NULL 65 0 0 # ﾖ [ff96 ]x

Thanks!

Shreeshrii commented 7 years ago

@theraysmith Are Halfwidth katakana included in your new Japanese training?

hoangtocdo90 commented 7 years ago

@Shreeshrii I think halfwidth katakana not include @theraysmith sir ? can you tell me how to get glyph_metrics in unicharset?

Shreeshrii commented 7 years ago

@hoangtocdo90 Please see https://github.com/tesseract-ocr/langdata/issues/81#issuecomment-320821042 and reply to Ray's questions there.

tesseract-ocr / tesseract

Can add new char to unicharset file if not appear in langdata folder? #1046