tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.2k stars 9.51k forks source link

Can add new char to unicharset file if not appear in langdata folder? #1046

Closed hoangtocdo90 closed 7 years ago

hoangtocdo90 commented 7 years ago

Hi guys ! I'm try training tesseract in Japanese. In Japanese has some type of char. In my case it's about Halfwidth and fullwidth in Katakana table. Half-width Katakana Example : アイウエオ カキクケコ Full-width Katakana Example : アイウエオ カキクケコ It's really look like similar or look like uppercase and lowercase but diffirence When input a Halfwidth katakana, Tesseract can't recognize or some times out with Full-width katakana.

I try to using text2img make image and box, doing ltsm.train. But have some problem with unicharset! set_unicharset_properties -U unicharset -O unicharset -X jpn.xheights --script_dir=./langdata I have checked in langdata/Katakana.unicharset. Don't have any half-width katakana symbol. Because of this i can't make a unicharset file with all the fields set to the right values, like in this example

This is my unicharset file i got from run command

unicharset_extractor jpn.msgothic.exp18.box jpn.msgothic.exp32.box jpn.msgothic.exp48.box jpn.msgothicb.exp18.box

414 NULL 0 NULL 0 Joined 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # Joined [4a 6f 69 6e 65 64 ] |Broken|0|1 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # Broken ア 1 0,255,0,255,0,0,0,0,0,0 NULL 3 0 0 # ア [ff71 ]x ウ 1 0,255,0,255,0,0,0,0,0,0 NULL 4 0 0 # ウ [ff73 ]x ト 1 0,255,0,255,0,0,0,0,0,0 NULL 5 0 0 # ト [ff84 ]x フ 1 0,255,0,255,0,0,0,0,0,0 NULL 6 0 0 # フ [ff8c ]x ゚ 1 0,255,0,255,0,0,0,0,0,0 NULL 7 0 0 # ゚ [ff9f ]x ッ 1 0,255,0,255,0,0,0,0,0,0 NULL 8 0 0 # ッ [ff6f ]x マ 1 0,255,0,255,0,0,0,0,0,0 NULL 9 0 0 # マ [ff8f ]x ル 1 0,255,0,255,0,0,0,0,0,0 NULL 10 0 0 # ル [ff99 ]x ハ 1 0,255,0,255,0,0,0,0,0,0 NULL 11 0 0 # ハ [ff8a ]x チ 1 0,255,0,255,0,0,0,0,0,0 NULL 12 0 0 # チ [ff81 ]x シ 1 0,255,0,255,0,0,0,0,0,0 NULL 13 0 0 # シ [ff7c ]x カ 1 0,255,0,255,0,0,0,0,0,0 NULL 14 0 0 # カ [ff76 ]x オ 1 0,255,0,255,0,0,0,0,0,0 NULL 15 0 0 # オ [ff75 ]x エ 1 0,255,0,255,0,0,0,0,0,0 NULL 16 0 0 # エ [ff74 ]x ラ 1 0,255,0,255,0,0,0,0,0,0 NULL 17 0 0 # ラ [ff97 ]x ン 1 0,255,0,255,0,0,0,0,0,0 NULL 18 0 0 # ン [ff9d ]x ヒ 1 0,255,0,255,0,0,0,0,0,0 NULL 19 0 0 # ヒ [ff8b ]x ゙ 1 0,255,0,255,0,0,0,0,0,0 NULL 20 0 0 # ゙ [ff9e ]x ユ 1 0,255,0,255,0,0,0,0,0,0 NULL 21 0 0 # ユ [ff95 ]x ー 1 0,255,0,255,0,0,0,0,0,0 NULL 22 0 0 # ー [ff70 ]x サ 1 0,255,0,255,0,0,0,0,0,0 NULL 23 0 0 # サ [ff7b ]x ォ 1 0,255,0,255,0,0,0,0,0,0 NULL 24 0 0 # ォ [ff6b ]x ス 1 0,255,0,255,0,0,0,0,0,0 NULL 25 0 0 # ス [ff7d ]x テ 1 0,255,0,255,0,0,0,0,0,0 NULL 26 0 0 # テ [ff83 ]x ィ 1 0,255,0,255,0,0,0,0,0,0 NULL 27 0 0 # ィ [ff68 ]x ヴ 1 0,255,0,255,0,0,0,0,0,0 NULL 28 0 0 # ヴ [30f4 ]x ョ 1 0,255,0,255,0,0,0,0,0,0 NULL 29 0 0 # ョ [ff6e ]x ェ 1 0,255,0,255,0,0,0,0,0,0 NULL 30 0 0 # ェ [ff6a ]x ノ 1 0,255,0,255,0,0,0,0,0,0 NULL 31 0 0 # ノ [ff89 ]x ホ 1 0,255,0,255,0,0,0,0,0,0 NULL 32 0 0 # ホ [ff8e ]x ヤ 1 0,255,0,255,0,0,0,0,0,0 NULL 33 0 0 # ヤ [ff94 ]x リ 1 0,255,0,255,0,0,0,0,0,0 NULL 35 0 0 # リ [ff98 ]x ネ 1 0,255,0,255,0,0,0,0,0,0 NULL 36 0 0 # ネ [ff88 ]x イ 1 0,255,0,255,0,0,0,0,0,0 NULL 37 0 0 # イ [ff72 ]x ヘ 1 0,255,0,255,0,0,0,0,0,0 NULL 38 0 0 # ヘ [ff8d ]x ク 1 0,255,0,255,0,0,0,0,0,0 NULL 39 0 0 # ク [ff78 ]x タ 1 0,255,0,255,0,0,0,0,0,0 NULL 40 0 0 # タ [ff80 ]x ニ 1 0,255,0,255,0,0,0,0,0,0 NULL 41 0 0 # ニ [ff86 ]x ケ 1 0,255,0,255,0,0,0,0,0,0 NULL 42 0 0 # ケ [ff79 ]x コ 1 0,255,0,255,0,0,0,0,0,0 NULL 43 0 0 # コ [ff7a ]x ナ 1 0,255,0,255,0,0,0,0,0,0 NULL 44 0 0 # ナ [ff85 ]x ロ 1 0,255,0,255,0,0,0,0,0,0 NULL 45 0 0 # ロ [ff9b ]x メ 1 0,255,0,255,0,0,0,0,0,0 NULL 46 0 0 # メ [ff92 ]x ソ 1 0,255,0,255,0,0,0,0,0,0 NULL 47 0 0 # ソ [ff7f ]x ミ 1 0,255,0,255,0,0,0,0,0,0 NULL 48 0 0 # ミ [ff90 ]x セ 1 0,255,0,255,0,0,0,0,0,0 NULL 49 0 0 # セ [ff7e ]x キ 1 0,255,0,255,0,0,0,0,0,0 NULL 50 0 0 # キ [ff77 ]x ワ 1 0,255,0,255,0,0,0,0,0,0 NULL 51 0 0 # ワ [ff9c ]x レ 1 0,255,0,255,0,0,0,0,0,0 NULL 52 0 0 # レ [ff9a ]x ャ 1 0,255,0,255,0,0,0,0,0,0 NULL 53 0 0 # ャ [ff6c ]x ュ 1 0,255,0,255,0,0,0,0,0,0 NULL 54 0 0 # ュ [ff6d ]x モ 1 0,255,0,255,0,0,0,0,0,0 NULL 60 0 0 # モ [ff93 ]x ム 1 0,255,0,255,0,0,0,0,0,0 NULL 61 0 0 # ム [ff91 ]x ァ 1 0,255,0,255,0,0,0,0,0,0 NULL 62 0 0 # ァ [ff67 ]x ゥ 1 0,255,0,255,0,0,0,0,0,0 NULL 63 0 0 # ゥ [ff69 ]x ツ 1 0,255,0,255,0,0,0,0,0,0 NULL 64 0 0 # ツ [ff82 ]x ヨ 1 0,255,0,255,0,0,0,0,0,0 NULL 65 0 0 # ヨ [ff96 ]x

Thanks!

Shreeshrii commented 7 years ago

@theraysmith Are Halfwidth katakana included in your new Japanese training?

hoangtocdo90 commented 7 years ago

@Shreeshrii I think halfwidth katakana not include @theraysmith sir ? can you tell me how to get glyph_metrics in unicharset?

Shreeshrii commented 7 years ago

@hoangtocdo90 Please see https://github.com/tesseract-ocr/langdata/issues/81#issuecomment-320821042 and reply to Ray's questions there.