tesseract-ocr / langdata

Source training data for Tesseract for lots of languages
Apache License 2.0
827 stars 886 forks source link

How to Add or Edit [script].unicharset in langdata folder? #99

Open sethleech opened 6 years ago

sethleech commented 6 years ago

How to Add or Edit [script].unicharset in langdata folder?

Dear all,

I am trying tesseart recently and it is really a very good product. I would like to ask if there is any tutorial or steps about How to Add or Edit [script].unicharset? for example han.unicharset

I want to add missing chars or unicode chars for CJK Extensions B,C,D,E,F. CJK Unified Ideographs Extension B: U+20000–U+2A6D6 CJK Unified Ideographs Extension C: U+2A700–U+2B734 CJK Unified Ideographs Extension D: U+2B740–U+2B81D CJK Unified Ideographs Extension E: U+2B820–U+2CEA1 CJK Unified Ideographs Extension F: U+2CEB0–U+2EBE0

Please refer : when training tesseract, I tried this

1st try : unicharset_extractor tesseract-ocr/unicharset_extractor -D [lang] [lang]/[lang].[font].exp0.box

output is unicharset : 𥮗 1 0,255,0,255,0,0,0,0,0,0 NULL 4 0 0 # 𥮗 [25b97 ]x

set_unicharset_properties tesseract-ocr/set_unicharset_properties -U unicharset -O [lang].unicharset --script_dir=langdata/[lang] --X langdata/[lang]/han.xheights

Warning: properties incomplete for index 4 = 𥮗

output is [lang].unicharset : 𥮗 1 0,255,0,255,0,0,0,0,0,0 NULL 4 0 0 # 𥮗 [25b97 ]x => not changed

2nd try : I edited file langdata/han.unicharset line 0 : 23514 -> 23515 add new line in end of lines 𥮗 1 61,64,255,255,188,200,6,11,205,224 Han 23514 0 23514 𥮗 # 𥮗 [25b97 ]x copied data 61,64,255,255,188,200,6,11,205,224 from any other line. ex) line 67

unicharset_extractor tesseract-ocr/unicharset_extractor -D [lang] [lang]/[lang].[font].exp0.box

output is unicharset : 𥮗 1 0,255,0,255,0,0,0,0,0,0 NULL 4 0 0 # 𥮗 [25b97 ]x

set_unicharset_properties tesseract-ocr/set_unicharset_properties -U unicharset -O [lang].unicharset --script_dir=langdata/[lang] --X langdata/[lang]/han.xheights no warning

output is [lang].unicharset : 𥮗 1 61,64,255,255,188,200,6,11,205,224 Han 4 0 4 𥮗 # 𥮗 [25b97 ]x => changed

I found out 1) [script].unicharset file is officially supported. 2) entry properties : 'character' 'properties' 'glyph_metrics' 'script' 'other_case' 'direction' 'mirror' 'normed_form'

How to get 'glyph_metrics' data from [font or several fonts]?

Thank you in advance.

Regards,

Shreeshrii commented 6 years ago

Based on comments by @theraysmith, all other properties are not required for lstm training.

On 24-Oct-2017 12:34 PM, "sethleech" notifications@github.com wrote:

How to Add or Edit [script].unicharset in langdata folder?

  • I want to know How to get 'glyph_metrics' data from [font or several fonts].

Dear all,

I am trying tesseart recently and it is really a very good product. I would like to ask if there is any tutorial or steps about How to Add or Edit [script].unicharset? for example han.unicharset

I want to add missing chars or unicode chars for CJK Extensions B,C,D,E,F. CJK Unified Ideographs Extension B: U+20000–U+2A6D6 CJK Unified Ideographs Extension C: U+2A700–U+2B734 CJK Unified Ideographs Extension D: U+2B740–U+2B81D CJK Unified Ideographs Extension E: U+2B820–U+2CEA1 CJK Unified Ideographs Extension F: U+2CEB0–U+2EBE0

Please refer : when training tesseract, I tried this

1st try : unicharset_extractor tesseract-ocr/unicharset_extractor -D [lang] [lang]/[lang].[font].exp0.box

output is unicharset : 𥮗 1 0,255,0,255,0,0,0,0,0,0 NULL 4 0 0 # 𥮗 [25b97 ]x

set_unicharset_properties tesseract-ocr/set_unicharset_properties -U unicharset -O [lang].unicharset --script_dir=langdata/[lang] --X langdata/[lang]/han.xheights

Warning: properties incomplete for index 4 = 𥮗

output is [lang].unicharset : 𥮗 1 0,255,0,255,0,0,0,0,0,0 NULL 4 0 0 # 𥮗 [25b97 ]x => not changed

2nd try : I edited file langdata/han.unicharset line 0 : 23514 -> 23515 add new line in end of lines 𥮗 1 61,64,255,255,188,200,6,11,205,224 Han 23514 0 23514 𥮗 # 𥮗 [25b97 ]x copied data 61,64,255,255,188,200,6,11,205,224 from any other line. ex) line 67

unicharset_extractor tesseract-ocr/unicharset_extractor -D [lang] [lang]/[lang].[font].exp0.box

output is unicharset : 𥮗 1 0,255,0,255,0,0,0,0,0,0 NULL 4 0 0 # 𥮗 [25b97 ]x

set_unicharset_properties tesseract-ocr/set_unicharset_properties -U unicharset -O [lang].unicharset --script_dir=langdata/[lang] --X langdata/[lang]/han.xheights no warning

output is [lang].unicharset : 𥮗 1 61,64,255,255,188,200,6,11,205,224 Han 4 0 4 𥮗 # 𥮗 [25b97 ]x => changed

I found out

  1. [script].unicharset file is officially supported.
  2. entry properties : 'character' 'properties' 'glyph_metrics' 'script' 'other_case' 'direction' 'mirror' 'normed_form'

How to get 'glyph_metrics' data from [font or several fonts]?

Thank you in advance.

Regards,

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/99, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_oyd9YSyC9dte2FuL_UxEzgIEoZguks5svYvhgaJpZM4QD8GV .

sethleech commented 6 years ago

My project is running on android-device. By now Tesseract 4.0 can't be used on android-device because of build-issue, "AVX" and "SSE". So I can use Tesseract 3.05.01.

Pls any information?

baishuangcheng commented 4 years ago

I have the same question.